Week 1: Genome Analytics
Any tasks that you will be asked to do will be wrapped and styled as this
section. Answer any questions found in the associated docs/week#_tasks.ipynb
.
All other text will be directions or other explanatory text that you should read.
Section links
Adding reference files to our directory
Project 0 Overview
This project will focus on getting you accustomed to how nextflow works and its key functions. You will be constructing a nextflow pipeline that will annotate a small bacterial genome using Prokka, calculate some basic genome statistics using python (GC Content, and genome size) and extract the sequence for a specified region of the genome by its coordinates. The general format and structure of this repository and nextflow pipeline will be utilized for all other projects.
Objectives
Understand how to encode environment specifications for conda in YML files
Gain experience understanding how to run Nextflow and create channels
Acclimate to the structure and style of projects as workflows
Getting Started
Accept the provided github classroom link for project 0 on blackboard and clone it to your directory under the bf528 class space (/projectnb/bf528/students/username/)
Please read through the provided README.md, which explains the structure of the directory we are going to be working with throughout the semester.
From now on, you should always be exclusively be working in an interactive VScode session. You can request sessions up to 10 days in duration but I usually suggest you request 24 hr sessions. VScode offers a number of benefits:
a. It has a built-in terminal and file explorer / editor
b. The terminal is persistent for as long as your session
c. You can install various extensions like syntax highlighters
d. The VScode session runs on a compute node, and not the head node
e. We can submit jobs via qsub directly from this session without being on
a head node
Setting up our environment
Before we begin working with Nextflow, we will need to install it. For this, we will be using Conda to create an environment where we just have Nextflow (and its direct dependencies) installed. For all project work in this class, you will need to remember to activate this environment as soon as you begin.
To start, we are going to use the module with miniconda installed on the SCC. We are only going to create a single conda environment that contains Nextflow and a testing package. Our worklows will use their own self-contained environments or containers. Whenever you begin work for this course on the SCC, you should make sure to first load the miniconda module using the following command:
module load miniconda
Following the practices recommended here, we
will create our conda environment directly from a version controlled YML
file. Using VSCode, navigate to the envs/
directory, open a new file and name it
base_env.yml
. Below see an example of what these YML files should look like
when used to create conda environments:
name: name_of_env
channels:
- conda-forge
- bioconda
dependencies:
- desired_package=version_number
A few things to remember:
The order of channels is important! Developers and maintainers of Bioconda (the channel where most bioinformatics packages are hosted) have agreed that conda-forge should be listed first, and bioconda second. This is the channel priority, and conda uses this internally to ensure that it solves dependencies correctly.
You should specify a single package or tool in separate YML files. You should directly pin the exact version you desire to ensure transparency and reproducibility.
Conda will take care of the minor software dependencies, and you generally do not attempt to manually install specific versions of those.
The value after name will be the name of your conda environment and the name you use to activate it and make the packages and software available to you for use.
Create a YML file in the
envs/
directory namedbase_env.yml
and specify that you wish to installnextflow
version 24.04.2 andnf-test
version 0.8.4. Remember that the value after name in the file will be the name of the conda environment created, you may name this whatever is easiest for you to remember. We will be using this single environment all semester.-
Open a terminal in the top-level of your directory, use the
conda env create -f envs/base_env.yml
command to create the environment described in your YML file. Remember that for all work in this class, you will want to activate this environment as soon as you log into your VSCode session before you start working. To do this, you will have to remember to perform the following steps in your VScode terminal:module load miniconda
conda activate name_of_nextflow_env
Adding reference files to our directory
If you look in /projectnb/bf528/materials/project-0-genome-analytics, you have been provided with your choice of three (as of now) unidentified genomes.
Navigate to /projectnb/bf528/materials/project-0-genome-analytics
Pick one of the three available genomes and copy a single FASTA file representing an unknown bacterial genome to your
refs/
directory.
Generating a samplesheet
For all of our projects, we will be encoding all of the information including sample metadata, and sample filepaths in a single CSV file. We will then use Nextflow to read the information stored within this sample sheet to drive the workflow and specify what files we want it to operate on.
This pattern of encoding the samples in this way offers two key benefits:
We can easily see all of the files associated with a particular analysis
We can generalize our workflow and apply it to a different set of samples by simply supplying a different CSV containing the information for those samples in the same format.
- Using VScode, manually make a new text file at the top-level of your
directory (project-0-genome-analytics-your_username) and generate a simple CSV
named
samplesheet.csv
. The columns in this CSV should be namedname
andpath
and look something like below:
name,path
genomeA,refs/genomeA.fna
- Next open the
nextflow.config
file, and underparams
add a value calledsamplesheet
underneath the “//Reads and references” comment that encodes the path to your samplesheet.
samplesheet = "$projectDir/samplesheet.csv"
The $projectDir is a nextflow variable that always encodes the working directory of the workflow. This allows you to avoid hardcoding a system-specific path.
Creating our first nextflow channel and process
Please refer to the nextflow lecture for more details on this section.
Nextflow channels have a much more technical definition, but you may think of them as lists of information that nextflow uses to send data through a workflow. Channels are connected to processes through the inputs and outputs of the different steps of the workflow.
For our purposes, the nextflow channels we create will often follow a similar pattern of containing a set of files associated with a single sample, and associated metadata. Our channels will contain the file(s) and accessory information needed for whatever process or tool is being used. Often, this accessory information will be sample identifiers, which will be used to name output files or specify options during runtime.
As you can see from the samplesheet you generated, our first nextflow channel will contain two values, the name of the genome and its file path. This will allow us to pass this information to our desired processes for further analysis or manipulation.
We create these channels in the workflow main.nf
and they specify in Nextflow
which processes should be run and in what order. We will be creating our initial
channel from the information contained within the samplesheet we just created.
This first channel will contain as many values as rows present in the CSV and
it will contain the information needed to run our first process. For this first
process, this information is a name, and the path to the file that we are going
to operate on.
In the
main.nf
, create your first channel using a combination ofChannel.fromPath
,splitCsv
,map
,tuple
,params.samplesheet
andview()
. The goal is to create a channel that contains the information contained within the CSV.When you believe you have successfully constructed a channel containing the information from the samplesheet in the same format as above, you may run nextflow to test your channel output using the following command from a terminal at the top-level of your directory:
nextflow run main.nf -profile local,conda
This will run nextflow and it will print the contents of the channel you just created to standard out. Make sure your channel is structured as above where the first element of the tuple is the value encoded in the name column and the second element contains the value in the path column from the CSV.
Your channel when observed using view
should print to the console and look
like the following:
[name_of_bacterial_genome, /path/to/reference/genome.fna]
- When you have successfully created this channel, you should use the
set
function to assign it to a variable. Name this variablefa_ch
.
set
should replace the usage of view
, which does not assign your channel to
a variable but simply prints its contents to standard out. They are mutually
exclusive.
Week 1 Detailed Tasks Summary
Clone the github classroom link for project 0 to your student directory in the bf528 project space.
Read through the README.md, which describes the directory structure we will be utilizing throughout the semester.
Generate a YML file in the
envs/
directory that contains the specification for your base environment with Nextflow version 24.04.2 installed.Copy a single genome fna (FASTA) to your
refs/
directory.Make a
samplesheet.csv
file at the top-level of your directory containing the columnsname
andpath
with one row being the name and path for your single genome file.Edit the
nextflow.config
file and add aparam
calledsamplesheet
containing the path to your samplesheetGenerate your first nextflow channel,
fa_ch
, containing a tuple with the name and path values from yoursamplesheet.csv
Answer all of the questions in the provided jupyter notebook