Week 3: RNAseq
Section Links
Evaluate the QC metrics for the full data
Performing differential expression analysis using the filtered counts
[Basic gene set enrichment analysis]
Week 3 Overview
By now, your pipeline should execute all of the necessary steps to perform sample quality control, alignment, and quantification. This week, we will focus on re-running the pipeline with the full data files and beginning a basic differential expression analysis.
Objectives
Re-run your working pipeline on the full data files
Evaluate the QC metrics for the original samples
Choose a filtering strategy for your raw counts matrix
Perform basic differential expression on your data using DESeq2
Generate a sample-to-sample distance plot and PCA plot for your experiment
Docker images for your pipeline
FastQC: ghcr.io/bf528/fastqc:latest
multiQC: ghcr.io/bf528/multiqc:latest
VERSE: ghcr.io/bf528/verse:latest
STAR: ghcr.io/bf528/star:latest
Pandas: ghcr.io/bf528/pandas:latest
Switching to the full data
Once you’ve confirmed that your pipeline works end-to-end on the subsampled files, we are going to properly apply our workflow to the original samples. This will require only a few alterations in order to do.
Create a new directory in
samples/
calledfull_files/
. Copy the original files from/projectnb/bf528/materials/project_1_rnaseq/full_files/
to your newly created directory.Edit your
nextflow.config
and change the path found in yourparams.reads
to reflect the location of your full files.
Most of our bioinformatics experiments will involve relatively large files and expensive operations. Even relatively small RNAseq experiments will still involve aligning tens of millions of reads to genomes that are many megabases long. Now that we are working with these larger files, we should not and in some cases, will not, be able to run these tasks locally on the same node that our VSCode session is running.
We will switch now to running our jobs on the cluster utilizing the qsub
utility
for queueing jobs to run on compute nodes. This will enable us to both request
nodes that have faster processors / more RAM and to easily parallelize our tasks
that can run simultaneously.
To run your workflow on the cluster, switch to using the cluster
profile option
in place of the local
flag. Your new nextflow command should now be:
nextflow run main.nf -profile singularity,cluster
You may examine the progress and status of your jobs by using the qstat
utility
as discussed in lecture and lab.
Evaluate the QC metrics for the full data
After your pipeline has finished, inspect the MultiQC report generated from the full samples.
- In your provided notebook, comment on the general quality of the sequencing reads. Write a paragraph in the style of a paper reporting what you find and any metrics that might be concerning.
Filtering the counts matrix
We will typically filter our counts matrices to remove genes that we believe will be uninformative for the DE analysis. It is important to remember that filtering is subjective and meant to reduce our statistical testing burden.
- Choose a filtering strategy and apply it to your counts matrix. In the provided notebook, report the strategy you used and create a plot that demonstrates the effects of your filtering on the counts for all of your samples.
Performing differential expression analysis using the filtered counts
Refer to the DESeq2 vignette on how to perform a basic differential expression analysis. For this dataset, you will simply be testing for differences between the condition (control vs. experimental). Choose an appropriate padj threshold to generate a list of statistically significant differentially expressed genes from your analysis.
Perform a basic differential expression analysis and produce the following as well formatted figures:
A table containing the DESeq2 results for the top ten significant genes ranked by padj
The results from a DAVID or ENRICHR analysis on the significant genes at your chosen padj threshold.
RNAseq Quality Control Plots
It is common to produce both a PCA plot as well as a sample-to-sample distance matrix from our counts to assist us in our confidence in whether the differences we see in the differential expression analysis can likely be contributed to our biological condition of interest.
Choose an appropriate normalization strategy (rlog or vst) and generate a normalized counts matrix for the experiment
Perform PCA on this normalized counts matrix and overlay the sample information in a biplot of PC1 vs. PC2
Create a heatmap or graphic of the sample-to-sample distances for the experiment
In the provided notebook, comment in no less than two paragraphs about your interpretations of these plots and if you believe the DE analysis was successful.