Week 3: RNAseq

Week 3 Overview

By now, your pipeline should execute all of the necessary steps to perform sample quality control, alignment, and quantification. This week, we will focus on re-running the pipeline with the full data files and beginning a basic differential expression analysis.

Objectives

  • Re-run your working pipeline on the full data files

  • Evaluate the QC metrics for the original samples

  • Choose a filtering strategy for your raw counts matrix

  • Perform basic differential expression on your data using DESeq2

  • Generate a sample-to-sample distance plot and PCA plot for your experiment

Docker images for your pipeline

FastQC: ghcr.io/bf528/fastqc:latest

multiQC: ghcr.io/bf528/multiqc:latest

VERSE: ghcr.io/bf528/verse:latest

STAR: ghcr.io/bf528/star:latest

Pandas: ghcr.io/bf528/pandas:latest

Switching to the full data

Once you’ve confirmed that your pipeline works end-to-end on the subsampled files, we are going to properly apply our workflow to the original samples. This will require only a few alterations in order to do.

  1. Create a new directory in samples/ called full_files/. Copy the original files from /projectnb/bf528/materials/project_1_rnaseq/full_files/ to your newly created directory.

  2. Edit your nextflow.config and change the path found in your params.reads to reflect the location of your full files.

Most of our bioinformatics experiments will involve relatively large files and expensive operations. Even relatively small RNAseq experiments will still involve aligning tens of millions of reads to genomes that are many megabases long. Now that we are working with these larger files, we should not and in some cases, will not, be able to run these tasks locally on the same node that our VSCode session is running.

We will switch now to running our jobs on the cluster utilizing the qsub utility for queueing jobs to run on compute nodes. This will enable us to both request nodes that have faster processors / more RAM and to easily parallelize our tasks that can run simultaneously.

To run your workflow on the cluster, switch to using the cluster profile option in place of the local flag. Your new nextflow command should now be:

nextflow run main.nf -profile singularity,cluster

You may examine the progress and status of your jobs by using the qstat utility as discussed in lecture and lab.

Evaluate the QC metrics for the full data

After your pipeline has finished, inspect the MultiQC report generated from the full samples.

  1. In your provided notebook, comment on the general quality of the sequencing reads. Write a paragraph in the style of a paper reporting what you find and any metrics that might be concerning.

0.7 Choosing a filtering strategy for your raw counts

Filtering the counts matrix

We will typically filter our counts matrices to remove genes that we believe will be uninformative for the DE analysis. It is important to remember that filtering is subjective and meant to reduce our statistical testing burden.

  1. Choose a filtering strategy and apply it to your counts matrix. In the provided notebook, report the strategy you used and create a plot that demonstrates the effects of your filtering on the counts for all of your samples.

Performing differential expression analysis using the filtered counts

Refer to the DESeq2 vignette on how to perform a basic differential expression analysis. For this dataset, you will simply be testing for differences between the condition (control vs. experimental). Choose an appropriate padj threshold to generate a list of statistically significant differentially expressed genes from your analysis.

Perform a basic differential expression analysis and produce the following as well formatted figures:

  1. A table containing the DESeq2 results for the top ten significant genes ranked by padj

  2. The results from a DAVID or ENRICHR analysis on the significant genes at your chosen padj threshold.

RNAseq Quality Control Plots

It is common to produce both a PCA plot as well as a sample-to-sample distance matrix from our counts to assist us in our confidence in whether the differences we see in the differential expression analysis can likely be contributed to our biological condition of interest.

  1. Choose an appropriate normalization strategy (rlog or vst) and generate a normalized counts matrix for the experiment

  2. Perform PCA on this normalized counts matrix and overlay the sample information in a biplot of PC1 vs. PC2

  3. Create a heatmap or graphic of the sample-to-sample distances for the experiment

  4. In the provided notebook, comment in no less than two paragraphs about your interpretations of these plots and if you believe the DE analysis was successful.

FGSEA Analysis

perform a basic FGSEA analysis and report the results