Week 1: Genome Analytics
Any tasks that you will be asked to do will be wrapped and styled as this
section. If you are asked to answer any questions or report findings, please do
so in the included docs/week#_tasks.Rmd
.
All other text will be directions or other explanatory text that you should read.
Section links
Adding reference files to our directory
Project 0 Overview
This project will focus on getting you accustomed to how nextflow works and its key functions. You will be constructing a small nextflow pipeline that will annotate a small bacterial genome using Prokka, calculate some basic genome statistics using python (GC Content, and genome size) and extract the sequence for a specified region of the genome by its coordinates. The general format and structure of this repository and nextflow pipeline will be utilized for all other projects.
Objectives
- Understand how to encode environment specifications for conda in YML files
- Gain experience understanding how to run Nextflow and create channels
- Acclimate to the structure and style of projects as workflows
Getting Started
Accept the provided github classroom link for project 0 on blackboard and clone it to your directory under the bf528 class space (/projectnb/bf528/students/username/)
Please read through the provided README.md, which explains the structure of the directory we are going to be working with throughout the semester.
Setting up our environment
Before we begin working with Nextflow, we will need to install it. For this, we will be using Conda to create an environment where we just have Nextflow (and its direct dependencies) installed. For all project work in this class, you will need to remember to activate this environment as soon as you begin.
Following the practices recommended here, we
will be creating this conda environment directly from a YML file and this file
will be version controlled and included in the analysis repository. Using
VSCode, navigate to the envs/
directory, open a new file and name it
base_env.yml
. Below see an example of these YML files should look like when
used to create conda environments:
name: name_of_env
channels:
- conda-forge
- bioconda
dependencies:
- desired_package=version_number
A few things to remember:
The order of channels is important! Developers and maintainers of Bioconda (the channel where most bioinformatics packages are hosted) have agreed that conda-forge should be listed first, and bioconda second. This is the channel priority, and conda uses this internally to ensure that it solves dependencies correctly.
You should specify a single package or tool in separate YML files. You should directly pin the exact version you desire to ensure transparency and reproducibility.
Conda will take care of the minor software dependencies, and you generally do not attempt to manually install specific versions of those.
The value after name will be the name of your conda environment and the name you use to activate it and make the packages and software available to you for use.
For our purposes, you will be using a slightly older version of Nextflow to ensure that these directions do not get out of sync. Create a YML file in the
envs/
directory namedbase_env.yml
and specify that you wish to install Nextflow version 24.04.2. Remember that the value after name in the file will be the name of the conda environment created, you may name this whatever is easiest for you to remember. We will be using this single environment all semester.Separately, using either
conda search -c conda-forge -c bioconda <package-name>
or by navigating to the tool’s website / github, report the most recent version available for Nextflow in the provided notebook.-
Answer the following questions in the provided notebook:
What is the advantage of using the most up-to-date versions of software?
What are some disadvantages?
-
Open a terminal in the top-level of your directory, use the
conda env create -f envs/base_env.yml
command to create the environment described in your YML file. Remember that for all work in this class, you will want to activate this environment as soon as you log into your VSCode session before you start working. To do this, you will have to remember to perform the following steps in your VScode terminal:module load miniconda
conda activate name_of_nextflow_env
Adding reference files to our directory
If you look in
- Pick one of the three available genomes and copy a single FASTA file
representing an unknown bacterial genome to your
refs/
directory.
Generating a samplesheet
For all of our projects, we will be encoding all of the information including sample metadata, and sample filepaths in a single CSV file. We will then use Nextflow to read the information stored within this sample sheet to drive the workflow and specify what files we want it to operate on.
This pattern of encoding the samples in this way offers two key benefits:
We can easily see all of the files associated with a particular analysis
We can generalize our workflow and apply it to a different set of samples by simply supplying a different CSV containing the information for those samples.
Using VScode, manually make a new text file at the top-level of your directory (project-0-genome-analysis) and generate a simple CSV named
samplesheet.csv
with two columns namedname
andpath
and one row of data which specifies the name of the genome you picked (look at the filename or the first line in the file) and the path to where it is located relative to the top-level directory (refs/your_genome_name.fna)Next open the
nextflow.config
file, and underparams
add a value underneath the “//Reads and references” comment that encodes the path to your samplesheet. You may use the Nextflow shorthand $projectDir to refer to the working directory (top-level).
Creating our first nextflow channel and process
Please refer to the nextflow lecture for more details on this section.
Nextflow channels have a much more technical definition, but you may think of them as a way in which nextflow sends data through a workflow. Channels are connected to processes through the inputs and outputs of the different steps of the workflow.
For our purposes, the nextflow channels we create will often follow a similar pattern of containing a set of files associated with a single sample, and associated metadata. Our channels will contain the file(s) and accessory information needed for whatever process or tool is being used. Often, this accessory information will be sample identifiers, which will be used to name output files or specify options during runtime.
As you can see from the samplesheet you generated, our first nextflow channel will contain two values, the name of the genome and its file path. This will allow us to pass this information to our desired processes for further analysis or manipulation.
To run nextflow, simply make sure that your environment with nextflow installed is activated and then use a terminal to enter the following command in the directory where your nextflow script (main.nf) is located:
nextflow run main.nf -profile local,conda
-
Take a look at the
nextflow.config
and look for theprofile
section. Answer the following question in the provided notebook:- What do you think the option
-profile local,conda
is doing?
- What do you think the option
Using VScode, open the
main.nf
file and begin by adding our first channel in the workflow declaration. This channel should read the information from the samplesheet and creates a tuple containing the value contained withinname
andpath
in that order.This will require a combination of nextflow functions and operators including:
Channel.fromPath
,splitCsv
,map
,tuple
andset
. You can also useview
to see the contents of a generated channel.Your channel when observed using
view
should look something like: [name_of_bacterial_genome, /path/to/reference/genome.fna]When you have successfully created this channel, you should use the
set
function to assign it to a variable. Name this variablefa_ch
.set
should replace the usage ofview
, which does not assign your channel to a variable but simply prints its contents to standard out.
Week 1 Tasks Summary
Cloned the github classroom link for project 0 to your student directory in the bf528 project space.
Read through the README.md, which describes the directory structure we will be utilizing throughout the semester.
Generated a YML file in the
envs/
directory that contains the specification for your base environment with Nextflow version 24.04.2 installed.Copied a single genome fna (FASTA) to your
refs/
directory.Made the
samplesheet.csv
file at the top-level of your directory containing the columnsname
andpath
with one row being the name and path for your single genome file.Edited the
nextflow.config
file and added aparam
containing the path to your samplesheetGenerated your first nextflow channel,
fa_ch
, containing a tuple with the name and path stored in your samplesheet.