Week 2: RNAseq
Week 2 Overview
Now that we have performed basic quality control on the FASTQ files, we are going to map them to the human reference genome to generate alignments for each of our sequencing reads. After alignment, we will aggregate the outputs from FASTQC and STAR into a single report summarizing some of the important quality control metrics describing our sequencing reads and the alignments. We will then quantify the alignments in our BAM file to the gene-level using VERSE.
Objectives
Align your sequencing reads to the human reference genome using STAR
Use MultiQC to generate a single report containing the quality metrics for the sequencing reads and alignments
Generate gene-level counts using VERSE for each of the samples
Concatenate gene-level counts from each sample into a single counts matrix
Docker images for your pipeline
FastQC: ghcr.io/bf528/fastqc:latest
multiQC: ghcr.io/bf528/multiqc:latest
VERSE: ghcr.io/bf528/verse:latest
STAR: ghcr.io/bf528/star:latest
Pandas: ghcr.io/bf528/pandas:latest
Aligning reads to the genome
Last week you generated a STAR index to enable alignment of reads to the human reference genome. This week, you will use this index to align the sequencing reads to the genome.
Remember that paired end reads are almost always used in conjunction with each other (R1 and R2) and that they collectively represent the reads from a single sample. When we align both of these paired end reads to the genome, we will generate a single set of all valid alignments for the sample.
By default, many alignment programs will output these alignments in SAM format. As discussed in lecture, the BAM format is a compressed version of SAM files that contains the same information. Oftentimes, we will simply choose to generate BAM files in place of SAM files in order to preserve disk space.
- Look at the documentation for STAR and focus on section 3 (pg. 7) for how to use STAR to run a basic mapping job. Construct a working nextflow module that performs basic alignment using STAR.
Your STAR command should include only the following options and all others may be left at their default value:
--runThreadN
, --genomeDir
, --readFilesIn
, --readFilesCommand
,
--outFileNamePrefix
, --outSAMtype
- Remember that by default, nextflow stores all of the outputs for a specific task in the staged directory in which it ran. Often, we will want to inspect the output files or log files from various processes.
Ensure that your STAR module has two separate named outputs using emit
. The two
outputs will be the BAM file and another for the log file generated during
alignment named with the extension .Log.final.out
.
- Ensure that you use the
process_high
for thelabel
and the appropriate containerghcr.io/bf528/star:latest
.
The log file from STAR will allow us to collect certain statistics about the alignment rates that are useful for quality control purposes. As a general rule of thumb, if there were no obvious issues with the sequencing preparation or errors in the alignment, we expect a substantial proportion of our reads to align to the reference genome. For well-annotated and studied genomes like human or mouse, we usually see alignment rates >70-80% for successful NGS experiments. Lower alignment rates are often expected for genomes that have not been sequenced to the same quality and depth as the more commonly used references. Make sure to evaluate these alignment rates in an experiment-specific context as there is no set threshold or cutoff that is appropriate for all cases.
Performing post-alignment QC
Typically after performing alignment, it is good to obtain a few post-alignment quality control metrics to quickly check if there appear to be any major problems with the data. At this step, we will typically evaluate the quality of the reads themselves (PHRED scores, contamination, etc.) along with the alignment rate to the reference genome.
As we’ve discussed, in larger experiments, it will quickly become cumbersome or unfeasible to manually inspect the results for all of our samples individually. Additionally, if we only look at one sample at a time, we may miss larger trends or biases across all of our samples. To solve this issue, we will be making use of MultiQC, which is a tool that simply aggregates the relevant logs and outputs from various bioinformatics utilities into a nicely formatted HTML report.
Since you are working with files that have been intentionally filtered to make them smaller, the actual outputs from fastQC and STAR will be misleading. Do not draw any conclusions from these reports generated on the subsetted data; the results will only be meaningful when you’ve switched to running this pipeline on the full dataset.
- Make a new module that will run
MultiQC. You can specify the label as
process_low
and setpublishDir
to yourresults/
directory. We will take advantage of the staging directory strategy that nextflow uses to run MultiQC.
By default, MultiQC will simply scan a directory and automatically detect any
of the common output files and logs created by the bioinformatics tools it
supports. For your input
, you can simply specify path('*')
and it creates
an HTML file as an output
.
- The tricky part with running MultiQC and Nextflow is that you will need to gather all of the output files from FASTQC and STAR and ensure that multiqc only runs after all of the samples have been processed by both of these tools.
Use a combination of map()
, collect()
, mix()
, flatten()
to create a
single channel that contains a list with all of the output files from FASTQC and
STAR logs for every sample and call it multiqc_ch
. Remember that you may access
the outputs of a previous process by using the .out()
notation (i.e. ALIGN.out
or FASTQC.out.zip).
See below for an example of what the channel should look like:
multiqc_channel
[sample1_R1_fastqc.zip, sample1_R2_fastqc.zip, sample1.Log.final.out,
sample2_R1_fastqc.zip, sample2_R2_fastqc.zip, sample2.Log.final.out, ...]
Add the MultiQC module to your workflow
main.nf
and run MultiQC. MultiQC should run a single time and only after every alignment and fastqc process has finished.Ensure that your
multiqc_report.html
is successfully created and contains the QC information from both FASTQC and STAR for all of your samples. You may open the HTML file through SCC ondemand.Make sure to include the
-f
flag in your multiqc command.
Quantifying alignments to the genome
In RNAseq, we are interested in quantifying gene expression and comparing that expression across conditions. We have so far generated alignments from the reads from all of our samples to their appropriate reference genome. We will use the information contained within the GTF (what each region of the genome represents) to assign these alignments to features and count them.
For differential expression analysis, our feature of interest will be exons as those are the regions of genes that largely comprise the sequences found in mRNA (which is what we are measuring and what was originally sequenced). We will generate a single count for every gene representing the sum of the union of all alignments falling into every exon annotated to that gene. This gene-level count will be used as a proxy for that gene’s expression in a particular sample.
We will be using VERSE, which is a read counting tool that will quantify alignments into counts based on a feature of interest. VERSE also has built-in strategies for assigning counts hierarchically in the case of overlapping features.
Generate a module that runs VERSE on each of your BAM files. You may leave all options at their default parameters. Be sure to include the
-S
flag in your final command.Run the VERSE module in your workflow
main.nf
and quantify the alignments in each of the BAM files
Concatenating count outputs into a single matrix
After VERSE has run successfully, you will have generated a single set of counts for each of your samples. To perform differential expression analysis, we will need to combine count outputs from each sample into a single file where the rows are the genes and the columns are the sample counts.
Write a python script that will concatenate all of the verse output files and write a single counts matrix containing all of your samples. As with any external script, make it executable with a proper shebang line and use argparse to allow the incorporation of command line arguments. I suggest you use
pandas
for this task and you can use the pandas containerghcr.io/bf528/pandas:latest
.Generate a module that runs this script and create a channel in your workflow
main.nf
that consists of all of the VERSE outputs. Incorporate this script into your workflow and Ensure that this module / script only executes after all of the VERSE tasks have finished.
Week 2 Detailed Task Summary
- Generate a module that runs STAR to align reads to a reference genome
- Ensure that you output the alignments in BAM format
- Use all default parameters
- Specify the log file with extension (.Log.final.out) as a nextflow output
Make a module that runs MultiQC using a channel that contains all of the FASTQC outputs and all of the STAR output log files
Create a module that runs VERSE on all of your output BAM files to generate gene-level counts for all of your samples
Write a python script that uses
pandas
to concatenate all of the VERSE outputs into a single counts matrix. Generate an accompanying nextflow module that runs this python script