Week 1: RNAseq
Section Links
Create a working directory for project 1
Changes to our environment management strategy and workflow
Generating our input channels for nextflow
Generate a file containing the gene IDs and their corresponding human gene symbols
Week 1 Overview
A basic RNAseq analysis consists of sample quality control, alignment, quantification and differential expression analysis. This week, we will be performing quality control analysis on the sequencing reads, generating a genome index for alignment, and making a mapping of human ensembl IDs to gene names.
Objectives
Perform basic quality control by running FASTQC on the sequencing reads
Generate a delimited file containing the mapping of human ensembl ids to gene names
Use STAR to create a genome index for the human reference genome
Create a working directory for project 1
Accept the github classroom link and clone the assignment to your student directory in /projectnb/bf528/students/your_username/. This link will be posted on the blackboard site for our class.
Locating the data
For this first project, we have provided you with small subsets of the original
data files. You should copy these files to the samples/
folder in your working
directory that you cloned. For your convenience, we have renamed the files to be
descriptive of the samples they represent following the pattern:
{condition}{rep}{readpair}.subset.fastq.gz (i.e. control_rep1_R1.subset.fastq.gz)
The files are located here: /projectnb/bf528/materials/project-1-rnaseq/subsampled_files and there are 12 total files. Remember that paired end reads are typically used together as both the _R1 and _R2 files represent sequencing reads from the same fragments. This means that we have 6 biological samples, and 12 actual files representing those samples.
Create a new directory in
samples/
calledsubsampled_files/
Using
cp
, make all of these files available in yoursubsampled_files/
directoryIn this same directory, /projectnb/bf528/materials/project-1-rnaseq/, you will also find a refs directory. Copy the files contained within to your
refs/
directory. Encode their paths in separate params in yournextflow.config
.
Changes to our environment management strategy and workflow
As we’ve discussed, conda environments are one solution to ensuring that your analyses are run in a reproducible and portable manner. Containers are an alternative technology that have a number of advantages over conda environments. Going forward, we will be incorporating containers specifying our computational environments into our pipeline.
One of the advantages of container technologies is that they usually offer a robust ecosystem of shared containers (images) that are available in public repositories for anyone to reuse. We have developed a set of containers for each piece of software that we will be using in this course.
For all future process scripts that you develop in nextflow, instead of
specifying the conda
environment to execute the task with, you will instead
specify container
and the image (a container built with a certain specification)
location. For example:
process FASTQC {
container 'ghcr.io/bf528/fastqc:latest'
...
}
For this course, these containers were all pre-built and their specifications kept in this public github repository: https://github.com/BF528/pipeline_containers. Feel free to look into the repo for how the containers were built. We follow a similar pattern where we specify the exact environment we would like created in a YAML file and then generate the environment using micromamba installed in a container. We will get some experience later in the semester with building your own containers from scratch.
In general, the containers will be named following the same pattern:
ghcr.io/bf528/<name-of-tool>:latest
or ghcr.io/bf528/fastqc:latest
.
New command for running your workflow
You will need to incorporate one more change to make your pipeline use these
containers. If you look in the nextflow.config
, you’ll see a profile labeled
singularity, which encodes some singularity options for nextflow to use.
When running your pipeline, you will now use the -profile singularity,local
option, which will have Nextflow execute tasks with the specified container image
listed in the module. The config contains some common options that nextflow will
automatically add to the singularity command when run with the specified profile.
nextflow run main.nf -profile singularity,local
As always, remember to activate your conda environment containing nextflow and nf-test before doing any work.
Docker images for your pipeline
FastQC: ghcr.io/bf528/fastqc:latest
multiQC: ghcr.io/bf528/multiqc:latest
VERSE: ghcr.io/bf528/verse:latest
STAR: ghcr.io/bf528/star:latest
Pandas: ghcr.io/bf528/pandas:latest
Generating our input channels for nextflow
This RNAseq pipeline will be driven by two channels that contain the starting
FASTQ files for each of the samples in the experiment. Nextflow has a built-in
function to simplify the generation of channels for samples from paired-end
sequencing experiments. The Channel.fromFilePairs
channel function allows you
to detect paired end fastq files for each sample using similar pattern matching
in bash through the use of wildcard expansion (*).
In the
nextflow.config
, specify a parameter calledreads
that encodes the path to your fastq files and uses * to flexibly detect the sample name associated with both paired files. Refer to the nextflow documentation hereChange explanation of where to encode path with *
In your workflow
main.nf
, use theChannel.fromFilePairs
function and the param you created in step 1 to create a channel calledalign_ch
. You’ll notice that this function creates a channel with a structure we’ve seen before: a tuple containing the base name of the file and a list containing the R1 and R2 file associated with that sample.
align_ch
[sample1, [sample1_R1.fastq.gz, sample1_R2.fastq.gz]]
- In your workflow
main.nf
create another channel using the exact logic from above but add an additional operation to create a channel that has as many elements in the channel as there are actual files (16). You may find information on common nextflow operators that will enable this here
Name this channel fastqc_channel
and it should be a list tuples where the
first value is the name of the sample and the second is the path to the
associated file. This will look something like below:
fastqc_channel
[sample1, sample1_R1.fastq.gz]
[sample1, sample1_R2.fastq.gz]
Performing Quality Control
At this point, you should have:
Setup a directory for this project by accepting the github classroom link
Copied the subsampled files to your working directory,
samples/
Familiarized yourself with the changes to environment management and how to run nextflow using containers
Generate two channels in your
main.nf
with the size and elements specified above
We will begin by performing quality control on the FASTQ files generated from the experiment. fastQC is a bioinformatics software tool that calculates and generates descriptive graphics of the various quality metrics encoded in a FASTQ file. We will use this tool to quickly check the basic quality of the sequencing in this experiment.
- In your envs/ directory, make a YML specification file to create a conda environment with the latest version of FASTQC installed. Use the same command as last week to generate a new environment with just fastqc installed:
conda env create -f envs/test_env.yml
- Activate this new environment after it’s created, and navigate to the
temp/
directory. You can view the help information for FASTQC by calling the program in a terminal withfastqc -h
or for some programsfastqc --help
. Try to run FASTQC on the provided fastq file.
By default, FASTQC will create two output files from a single input named as seen below:
fastqc sample1.fastq.gz
Outputs:
sample1_fastqc.html
sample1_fastqc.zip
- Make a new process for fastqc in the
modules/
directory. Be sure to specify the following in your module:
label 'process_low'
container 'ghcr.io/bf528/fastqc:latest'
publishDir <a param containing the path to results/>
- Specify the inputs of the process to match the structure of the
fastqc_channel
we just generated.For this module, please list two named outputs, which will later allow us to use the individual outputs separately:
output:
tuple val(name), path('*.zip'), emit: zip
tuple val(name), path('*.html'), emit: html
This output definition will instruct nextflow that both of these files should exist after FASTQC has run successfully. It will capture the html report and zip file separately and allow you to pass these different values through channels separately (i.e. FASTQC.out.zip would refer specifically to the zip file created by the FASTQC task)
Remember you can use wildcard expansions in the path output to flexibly detect files with certain extensions without specifying their full filename.
- The shell command should be the successful fastqc command you ran earlier. Ensure that the following argument is included in your actual fastQC command:
-t $task.cpus
- Incorporate the FASTQC process into your workflow
main.nf
and provide it the proper channel.
Generate a file containing the gene IDs and their corresponding human gene symbols
As we’ve discussed, it’s often more intuitive for us to use gene names rather than their IDs. While there are external services that can perform this for us (biomaRt), it is often better to extract this information from the GTF file associated with the exact version of the reference genome we are using. This will ensure that we are as consistent as possible with our labeling.
Whenever we need to perform operations using custom code, we are going to use
the conventions established in project 0. We will place this script in the bin/
directory and make it executable. We will then create a nextflow module that will
provide the appropriate command line arguments to the script.
Generate a python script that parses the GTF file you were provided and creates a delimited file containing the ensembl mouse ID and its corresponding gene name. Please copy and modify the
argparse
code used in the project 0 script to allow the specification of command line arguments. The script should take the GTF as input and output a single text file containing the requested information.Create a nextflow module that calls this script and provides the appropriate command line arguments necessary to run it.
You may use the biopython (ghcr.io/bf528/biopython:latest) or the pandas (ghcr.io/bf528/pandas:latest) container to run this task as both of these contain a python installation.
Incorporate this module to parse the GTF into your workflow
main.nf
and pass it the appropriate GTF input encoded as a param.
0.6 Creating your own process labels
As we discussed in class, you can request various ranges of computational resources
for your jobs / processes to use through the qsub
command and its accompanying
options.
Please refer to the following page for common combinations of options to request specific amounts of resources from nodes on the SCC.
By now you have noticed that some modules have a
label
and you can see in thenextflow.config
in the profiles section the exact qsub command and options each label requests.Create two new labels named
process_medium
andprocess_high
. Forprocess_medium
, use the correct options to request 8 cores and <= 32GB ram Forprocess_high
, set the options to request 16 cores and <= 128GB ram.
You will need to add a clusterOptions line to the process_high
to specify the
additional flags.
Generate a genome index
Most alignment algorithms require an index to be generated to make the alignment process more efficient and expedient. These index files are both specific to the tool and the reference genome used to build them. We will be using the STAR aligner, one of the most commonly used alignment tools for RNAseq.
Read the beginning of the documentation for how to generate a STAR index.
Generate a module and process that creates a STAR index for our reference genome. This process will require two inputs, the reference genome and the associated GTF file. Ensure that the following are specified in your module:
label 'process_high'
container 'ghcr.io/bf528/star:latest'
- The outputs of this process will be a directory of multiple files that all comprise the index. You will have to create this directory prior to running the STAR command. Use all of the basic options specified in the manual and leave them at their default values. Be sure to also include the following argument in your STAR command:
--runThreadN $task.cpus
- Incorporate this process into your workflow and pass it the appropriate inputs from your params encoding the path to the reference genome fasta and GTF file.
Week 1 Tasks Summary
Clone the github classroom link for this project
Copy the necessary files to your working directory. These files will include the following:
- Gencode hg38 reference genome for the primary assembly
- Gencode GTF for the primary assembly
- The subsampled FASTQ files
Generate a nextflow channel called
align_ch
that has 8 total elements where each element is a tuple containing two elements: the name of the sample, and a list of both of the associated paired end files.Generate a nextflow channel called
fastqc_channel
that has 16 total elements where each element is a tuple containing two elements: the name of the sample, and one of the FASTQ files.Generate a module that successfully runs FASTQC using the
fastqc_channel
Develop an external script that parses the GTF and writes a delimited file where one column represents the ensembl mouse IDs and the value in the other column is the associated MGI symbol.
Generate a module that successfully creates a STAR index using the params containing the path to your reference genome assembly and GTF