Week 3: RNAseq

Week 3 Overview

By now, your pipeline should execute all of the necessary steps to perform sample quality control, alignment, and quantification. This week, we will focus on re-running the pipeline with the full data files and beginning a basic differential expression analysis.

Objectives

  • Re-run your working pipeline on the full data files

  • Evaluate the QC metrics for the original samples

  • Choose a filtering strategy for your raw counts matrix

  • Perform basic differential expression on your data using DESeq2

  • Generate a sample-to-sample distance plot and PCA plot for your experiment

Docker images for your pipeline

FastQC: ghcr.io/bf528/fastqc:latest

multiQC: ghcr.io/bf528/multiqc:latest

VERSE: ghcr.io/bf528/verse:latest

STAR: ghcr.io/bf528/star:latest

Pandas: ghcr.io/bf528/pandas:latest

Switching to the full data

Once you’ve confirmed that your pipeline works end-to-end on the subsampled files, we are going to properly apply our workflow to the original samples. This will require only a few alterations in order to do.

  1. Create a new directory in samples/ called full_files/. Copy the original files from /projectnb/bf528/materials/project_1_rnaseq/full_files/ to your newly created directory.

  2. Edit your nextflow.config and change the path found in your params.reads to reflect the location of your full files.

Most of our bioinformatics experiments will involve relatively large files and expensive operations. Even relatively small RNAseq experiments will still involve aligning tens of millions of reads to genomes that are many megabases long. Now that we are working with these larger files, we should not and in some cases, will not, be able to run these tasks locally on the same node that our VSCode session is running.

We will switch now to running our jobs on the cluster utilizing the qsub utility for queuing jobs to run on compute nodes. This will enable us to both request nodes that have faster processors / more RAM and to easily parallelize our tasks that can run simultaneously.

As a reminder, to run your workflow on the cluster, switch to using the cluster profile option in place of the local flag as in project 0. Your new nextflow command should now be:

nextflow run main.nf -profile singularity,cluster

You may also need to unset the resume = true option in your config, or manually set resume = false when you attempt to rerun your workflow on the full data.

You may examine the progress and status of your jobs by using the qstat utility as discussed in lecture and lab.

Evaluate the QC metrics for the full data

After your pipeline has finished, inspect the MultiQC report generated from the full samples.

  1. In your provided notebook, comment on the general quality of the sequencing reads. Write a paragraph in the style of a publication reporting what you find and any metrics that might be concerning.

0.2 Analysis Tasks

You will typically be performing analyses in either a jupyter notebook or Rmarkdown. With the SCC, we will not be able to easily encapsulate R in an isolated environment.

Instead, simply load the R module (on the launch page for VSCode in on-demand) as you boot your VSCode extension and work in aRmarkdown. You may install packages as needed and ensure that you record the versions used with the sessionInfo() function.

Filtering the counts matrix

We will typically filter our counts matrices to remove genes that we believe will be uninformative for the DE analysis. It is important to remember that filtering is subjective and meant to reduce computational time, or remove uninformative rows.

  1. Choose a filtering strategy and apply it to your counts matrix. In the provided notebook, report the strategy you used and create a plot or a table that demonstrates the effects of your filtering on the counts for all of your samples. Ensure you mention how many genes are present before and after your filtering threshold.

Performing differential expression analysis using the filtered counts

Refer to the DESeq2 vignette on how to perform a basic differential expression analysis. For this dataset, you will simply be testing for differences between the condition (control vs. experimental). Choose an appropriate padj threshold to generate a list of statistically significant differentially expressed genes from your analysis.

You may refer to the official DESeq2 vignette or the BF591 instructions for how to run a basic differential expression analysis.

Perform a basic differential expression analysis and produce the following as well formatted figures:

  1. A table containing the DESeq2 results for the top ten significant genes ranked by padj. Your results should have the corresponding gene names for each gene symbol. (You extracted these earlier…)

  2. Choose an appropriate padj threshold and report the number of significant genes at this threshold.

  3. The results from a DAVID or ENRICHR analysis on the significant genes at your chosen padj threshold. Comment in a notebook what results you find most interesting from this analysis.

RNAseq Quality Control Plots

It is common to produce both a PCA plot as well as a sample-to-sample distance matrix from our counts to assist us in our confidence in whether the differences we see in the differential expression analysis can likely be contributed to our biological condition of interest. All of these plots have convenient wrapper functions already implemented in DESeq2 (see the vignette).

  1. Choose an appropriate normalization strategy (rlog or vst) and generate a normalized counts matrix for the experiment. Refer to the DESeq2 vignette here for specific directions on how to do this,

  2. Perform PCA on this normalized counts matrix and overlay the sample information in a biplot of PC1 vs. PC2

  3. Create a heatmap or graphic of the sample-to-sample distances for the experiment

  4. In a notebook, comment in no less than two paragraphs about your interpretations of these plots and what they indicate about the samples, and the experiment.

FGSEA Analysis

Perform a GSEA analysis on your RNAseq results. You are free to use any method available though we recommend fgsea.

  1. Choose an appropriate ranking metric and use the C2 canonical pathways MSIGDB dataset to perform a basic GSEA analysis.

  2. Using a statistical threshold of your choice, generate a figure or plot that displays the top most significant results from the FGSEA results.

  3. In a notebook, briefly remark on your results and what seems interesting to you about the biology.