Assignment 5

Problem Statement

One of the main questions asked by high throughput sequencing is whether some variable or condition (e.g. treatment/disease status) induces a measurable, distinguishable change. In the context of mRNAseq, we are interested in whether we can determine if mRNA levels (what we actually measure) and genes (what we infer) are significantly up- or downregulated between conditions and if we can relate these changes back to a biological phenotype either directly or indirectly.

Learning Objectives

  1. Understand the methodology and rationale for how DESeq2 normalizes data, models counts, and performs differential expression
  2. Perform differential expression analysis using DESeq2 and learn one method to analyze time course data
  3. Understand how to do basic inspection and evaluation of differential expression (DE) results

Skill List

  1. DESeq2
  2. Basic diagnostic plots of DE results
  3. High-quality figures and plots using R and ggplot

DESeq2 Background

DESeq2 is a well validated and highly cited method for differential expression analysis of HTS data. The original paper describing the methodology in more detail can be found here:

Love, M.I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15, 550 (2014). https://doi.org/10.1186/s13059-014-0550-8

We will link back to the original DESeq2 vignette at the head of each applicable section for reference. We strongly encourage you to read the associated documentation as it succinctly summarizes the important concepts and methodology underlying each step of the process. It also provides example code snippets and answers to many common questions regarding the usage of DESeq2 can already be found in this documentation.

Briefly, at its core, DESeq2 models read counts and tests for differential expression through negative binomial generalized linear models. DESeq2 normalizes count data by scaling by size factors calculated by the median-of-ratios (developed in the original publication for DESeq). For significance testing, DESeq2 performs a Wald test for model coefficients to generate p-values and these are adjusted for multiple testing correction through the Benjamini-Hochberg method.

Generating a counts matrix

DESeq2 takes as input a counts matrix where genes correspond to rows and samples to columns. For one of the more common use cases of differential expression analysis, mRNAseq experiments, we are interested in counts associated with genes; however, DESeq2 is broadly applicable to many kinds of high-throughout sequencing (HTS) count data.

Returning to the analysis of mRNAseq specifically, there are three main approaches to map reads:

  1. Alignment to a reference genome by a splice-aware aligner and counting of reads falling into exonic regions using a reference annotation file
  2. Transcript-based assembly approaches (TopHat, Cufflinks) or transcript abundance quantification (Salmon, Kallisto)
  3. Reference free assembly followed by mapping and counting

DESeq2 expects the values of the input matrix to be counts (integers) and for the purposes of this assignment, you have been provided with a count matrix of samples generated by the first method. Reads were aligned to the mouse genome (GENCODE GRCm39) with STAR and quantified to gene counts by Verse using the matching GTF file.

Prefiltering Counts matrix

DESeq2 Vignette - Pre-filtering

Unless your experiment is quite large, filtering your counts matrix prior to DESeq2 is not strictly required. If your experiment incorporates hundreds of samples and a complicated model design with many interaction terms, you may wish to perform a combination of manual filtering and further optimization of DESeq2 settings. Otherwise, for most datasets, default settings in DESeq2 will run in a reasonably short amount of time on relatively modest hardware even without pre-filtering.

There are many different strategies to filter count datasets and which you choose to employ should be informed by your objective, and your data. Some common filters include removing genes where the mean count across all samples is below some threshold or removing genes based on how many samples have a zero count. In general, extensive pre-filtering is not required and DESeq2 incorporates downstream methods to increase statistical power through independent filtering (described in methods). For this assignment, we will not be filtering the dataset prior to running DE analysis.

Median-of-ratios normalization

Original DESeq Publication

As mentioned earlier, DESeq2 expects a non-transformed, non-normalized matrix of integer “counts” as input. DESeq2 performs its own normalization based on the underlying assumption that not all or a small amount of genes are truly differentially expressed. This is encapsulated in the median-of-ratios method which accounts for library size as well as RNA composition. Although DESeq2 performs this normalization in the background, it can help conceptually to understand the process by which it does so.

First, a pseudo-reference sample is created and set equal to the geometric mean across all samples for a given gene. Then, ratios are calculated comparing each sample to this pseudo-reference on gene-wise basis. Next, on a sample-wise basis, the median value for all ratios is taken as that sample’s normalization factor and for a given sample, all counts are divided by this factor. You have learned how to extract these normalized counts by size factors in the last assignment.

DESeq2 preparation

DESeq2 Vignette - Preparation

The DESeqDataSet is the object that holds the read counts and associated statistical measures generated through the DESeq2 algorithm. Although there are several ways to generate this object, we will focus on the DESeqDataSet method which directly takes a SummarizedExperiment object.

O’Meara et al. Transcriptional Reversion of Cardiac Myocyte Fate During Mammalian Cardiac Regeneration. Circ Res. Feb 2015. PMID: 25477501l

The counts matrix provided was generated by data made available by the listed publication. We have only taken a subset of their samples, focusing specifically on the mRNAseq experiment during in vivo heart development from 0, 4, 7 days after birth as well as in adult mice.

To start, we are going to consider the simplest use case for differential expression and subset our data to contain only the samples from postnatal day 0 and adult hearts. In addition to subsetting the counts matrix, we also need to construct the sample data information as a dataframe that lists the samplenames (columns in our counts matrix) as well as their associated information (in our case, timepoint).

1. Reading and subsetting the data from verse_counts.tsv and sample_metadata.csv

We will use this as a demonstration of one of the simplest use cases for DESeq2, comparing two groups of samples based on a single experimental variable. Subset the samples to only include those from timepoints vP0 and vAd, which should correspond to a total of 4 samples (vP0_1, vP0_2, vAd_1, vAd_2). Store both the counts matrix and sample dataframe in a SummarizedExperiment object.

  1. You may ignore most of the columns in the metadata CSV as they were used in the generation of the data itself. Focus on the samplenames and timepoints columns.

  2. The columns (sample names) in your counts matrix and sample dataframe need to be in the same order

  3. You may need to convert your counts dataframe to a matrix

  4. The counts matrix is verse_counts.tsv and the sample metadata is sample_metadata.csv

2. Running DESeq2

Understanding Factor Levels

DESeq2 Vignette - Factor Levels

The factor level in DESeq2 determines which level represents the “control” group you want to compare against. By default, the reference level for factors will be chosen alphabetically. You can either manually change and set the reference factor level or specify a comparison of interest using DESeq2 Contrast() after performing the first steps of the DE analysis.

Performing Differential Expression Analysis

Here you may find the vignette for example usage of DESeq2. Please read the following section to instruct you on how to perform differential expression analysis in DESeq2: Running DESeq2

We will be running DESeq2 with a model design of ~timepoint with vP0 set as our reference level. This will test for differentially expressed genes in adult mouse heart ventricles vs. postnatal day 0. Return both the results as a dataframe and the dds object from running DESeq2 as a list. Assign the results dataframe and dds object to variables so you can use them later in your Rmd.

3. Annotating results to construct a labeled volcano plot

Later in the analysis, we will construct a volcano plot. Typically, volcano plots will label genes by their significance as well as their direction of change. Convert the results dataframe to a tibble and add a column labeled volc_plot_status that you can easily input in ggplot2 to color points by the below criteria.

  1. The labels will encompass three groups: 1. Positive log fold change and significant at a given padj threshold, 2. Negative log fold change and significant at a given padj threshold, 3. Remaining non-significant genes
  2. Ensure your column is named exactly volc_plot_status or the test will not work without modification.
  3. Have the values for these labels be `UP’, ‘DOWN’, ‘NS’ respectively.

4. Diagnostic plot of the raw p-values for all genes

Separate from differential expression analysis, it is always an important diagnostic to plot the unadjusted distribution of p-values obtained from an experiment. Built in to the definition of a p-value is the idea that under the null distribution and given all other assumptions made are true, p-values follow a uniform distribution. In general, you should typically observe a peak of values close to 0 and a roughly uniform distribution as you approach 1. The values closer to 0 will be a mix of situations where the alternative hypothesis is true as well as potential false positives. Significant deviations from this general pattern likely require close examination of your data and the significance test you are employing. Make a histogram of the raw p-values from all the genes discovered in the experiment.

5. Plotting the LogFoldChanges for differentially expressed genes

It is often helpful visualize the log2FoldChanges for your differentially expressed genes to gain a sense for the distribution of up- and downregulated genes and gain a global view of how genes are changing. Subset your results to only include genes significant at a padj threshold of < .10. Plot a histogram of the log2FoldChange values for these genes.

The choice of FDR cutoff depends on cost

The choice of a FDR cutoff or any p-value cutoff is always subjective and based on the objectives of your experiment and the cost of false discoveries. For example, if your initial sequencing experiment is exploratory and meant to generate hypotheses to be followed up in vitro or in vivo, it may be more appropriate to set a permissive cutoff (padj < .20). P-value thresholds are set on an experiment by experiment basis and should not be adjusted simply to increase or decrease the number of DE genes retroactively.

6. Plotting the normalized counts of differentially expressed genes

It can be helpful to visualize the normalized counts for a few differentially expressed genes as a quick diagnostic that your analysis is working as intended. Although you will likely not need to do this for every RNAseq analysis, it can provide confidence in your results and also help you remember the meaning of the directionality of your fold change values. Make a scatter / jitter plot of the DESeq2 normalized counts for the top ten significantly differentially expressed genes ranked by ascending padj

  1. If you are plotting all of the genes on the same plot, you may find it helpful to take the logarithm of the counts as different genes often have counts that differ by several orders of magnitude.

  2. You may find it helpful to add a slight horizontal “jitter” to the points to aid in visualization for cases where counts for each replicate are similar.

  3. You have already used DESeq2 to extract normalized counts. There are also directions available in the vignette.

7. Volcano Plot to visualize differential expression results

A volcano plot is a common data plot that is intended to quickly display points of interest by plotting statistical significance against a magnitude of change. In the specific case of mRNAseq data, this entails plotting the -log(p-values / adjusted p-values) against the estimated log fold changes reported by most differential expression methods.

Statistically significant genes with low p-values will appear higher on the y-axis of the plot and they will be separated by the magnitude and directionality of their change (i.e. Upregulated genes with larger fold change estimates will be further along to the right of the origin of the x-axis, and downregulated genes will be farther to the left of the origin on the x-axis). Typically, genes of interest that have additional biological relevance to the experiment will be directly annotated using domain knowledge.

Make a volcano plot (log2FoldChange vs. -log10(padj)) for all the genes discovered in the experiment. Use the column generated in part 3 to color the genes with the appropriate label.

8. Running fgsea vignette

We will be performing a GSEA on the results generated by running DESeq2 using the fgsea package. We will use log2FoldChange as our ranking metric and test against all the pathways in the C2 Canonical Pathways gene set collection provided by MSigDB and already downloaded here (c2.cp.v7.5.1.symbols.gmt). Read the section in the textbook or the fgsea documentation for the appropriate formats for the ranked gene list and the gene sets.

  1. You may read in the gene sets in the proper format using GSEABase or fgsea

  2. You will need to convert the mouse ensembl gene IDs to human HGNC symbols to match the genes provided in the C2 Canonical Pathways gene sets. If you are encountering issues, you may need to remove the version number of the annotation.

  3. As you have discovered, biomaRt can be difficult to work with. If you are encountering frequent issues, you may consider mapping your gene IDs and saving the results locally to avoid having to re-run the biomaRt query multiple times.

  4. You will inevitably have genes with duplicate names and/or genes with duplicate values for log2FoldChange. Choose a method to remove these values.

9. Plotting the top ten positive NES and top ten negative NES pathways

  1. Ensure that the pathway labels are visible

  2. You will probably get warnings about duplicate values or ranks, do your best to address these but don’t get hung up on trying to find them all.

References

Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013 Jan 1;29(1):15-21. doi: 10.1093/bioinformatics/bts635. Epub 2012 Oct 25. PMID: 23104886; PMCID: PMC3530905.

Zhu, Q., Fisher, S.A., Shallcross, J., Kim, J. (Preprint). VERSE: a versatile and efficient RNA-Seq read counting tool. bioRxiv 053306. doi:http://dx.doi.org/10.1101/053306

https://www.gencodegenes.org/mouse/

Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biology 11, 1–12 (2010).

Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15, 550 (2014).