Processing math: 68%

Week 6 Slides

R for Biological Science

Adam Labadorf

High Throughput Sequencing

High Throughput Sequencing

  • High throughput sequencing (HTS) technologies measure and digitize the nucleotide sequences of thousands to billions of DNA molecules simultaneously
  • HTS instruments can determine sequence of any DNA molecule in a sample
  • HTS datasets sometimes called unbiased assays
  • Most popular HTS technology today is provided by Illumina biotechnology company

Sequencing by synthesis

  • Illumina sequencing uses biotechnological technique called sequencing by synthesis
  • Sequencing instruments are called sequencers
  • The process, briefly:
    1. 107 to 109 short (~300 nucleotide long) DNA molecules ligated to glass slide
    2. Denatured to become single stranded,
    3. Complementary strand by incorporating fluorescently tagged nucleotides
    4. Tagged nucleotides excited by a laser and photograph taken of slide
    5. Images are processed to reconstruct DNA sequences from fluorescence events

HTS Measures DNA

  • Any DNA molecules may be subjected to sequencing via this method

  • Many different types of experiments are possible:

    • Whole genome sequencing (WGS)
    • RNA Sequencing (RNASeq)
    • Whole genome bisulfite sequencing (WGBS)
    • Chromatin immunoprecipitation followed by sequencing (ChIPSeq)
  • Each of these experiments create the same type of data

  • Each must be interpreted appropriately based on experiment

Raw HTS Data

  • Raw HTS data are the digitized DNA sequences for the billions of molecules captured by the flow cell
  • Each digitized nucleotide sequence is called a read
  • Data stored in a standardized format called the FASTQ file format

FASTQ Format

  • Data for a single read in FASTQ format:
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
  • @ - header, unique read name per flowcell
  • GAT... - the nucleotide sequence
  • + - second header, usually blank
  • !''... - the phred quality scores for each base in the read

HTS Sequence Analysis

  • Raw sequencing reads must be processed with bioinformatic algorithms
  • Two broad classes of sequence analysis:
    • de novo assembly to recover complete length of original molecules
    • sequence alignment against a set of reference sequences, e.g. genome
  • Most data we analyze in R comes from aligned reads that have been produced by other software

Sequence Alignment

  • Reads aligned to a reference sequence form a distribution of read counts across all locations in the reference sequence
  • Any given reference sequence location, or locus, has zero or more reads that align to it
  • The number of reads aligning to a given locus is called the read count
  • The read count is proportional to the copy number of molecules in the sample that correspond to that locus
  • Read counts from all genes in a reference genome form count data

Count Data

  • The read counts that align to each locus of a genome form a distribution

  • Count data have two important properties:

    • Count data are integers
    • Counts are non-negative
  • count data are not normally distributed

Analyzing Count Data

  • Techniques that assume data are normally distributed (like linear regression) are not appropriate for count data

  • Must account for this in one of two ways:

    • Use statistical methods that model counts data - generalized linear models that model data using Poisson and Negative Binomial distributions
    • Transform the data to have be normally distributed - certain statistical transformations, such as regularized log or rank transformations can make counts data more closely follow a normal distribution # RNASeq

RNASeq

  • RNA sequencing (RNASeq) digitizes the RNA molecules in a sample
  • Short reads (~100-300 nucleotides long) represent RNA fragments from longer transcripts
  • RNA molecules are (in principle) sequenced in proportion to their relative copy number in the original sample
  • Each sequencing dataset has a total number of reads called library size
  • Relative copy number means absence of evidence is not evidence of absence
    • i.e. count of zero does not necessarily mean no molecules

RNASeq Gene Expression Data

  • When aligned, reads are counted using a reference annotation that defines locations of genes in the genome
  • For a single sample, output is a vector of read counts for each gene in the annotation
  • Genes with no reads mapping to them will have a count of zero
  • All others will have a read count of one or more;
  • Complex multicellular organisms have on the order of thousands to tens of thousands of genes

The Counts Matrix

  • Read counts from multiple samples of genes using the same annotation are usually concatenated into a counts matrix
  • Counts matrix usually has genes as rows and samples as columns (not tidy!)
  • Usually in csv or tsv (tab separated) format
  • These counts are from O’Meara et al 2014
counts <- read_tsv("verse_counts.tsv")

Example Counts Matrix

counts
## # A tibble: 55,416 × 9
##    gene                  P0_1  P0_2  P4_1  P4_2  P7_1  P7_2  Ad_1  Ad_2
##    <chr>                <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 ENSMUSG00000102693.2     0     0     0     0     0     0     0     0
##  2 ENSMUSG00000064842.3     0     0     0     0     0     0     0     0
##  3 ENSMUSG00000051951.6    19    24    17    17    17    12     7     5
##  4 ENSMUSG00000102851.2     0     0     0     0     1     0     0     0
##  5 ENSMUSG00000103377.2     1     0     0     1     0     1     1     0
##  6 ENSMUSG00000104017.2     0     3     0     0     0     1     0     0
##  7 ENSMUSG00000103025.2     0     0     0     0     0     0     0     0
##  8 ENSMUSG00000089699.2     0     0     0     0     0     0     0     0
##  9 ENSMUSG00000103201.2     0     0     0     0     0     0     0     1
## 10 ENSMUSG00000103147.2     0     0     0     0     0     0     0     0
## # ℹ 55,406 more rows

Analyzing Counts Data

There are typically three steps when analyzing a RNASeq count matrix:

  1. Filter genes that are unlikely to be differentially expressed.
  2. Normalize filtered counts to make samples comparable to one another.
  3. Differential expression analysis of filtered, normalized counts.

Filtering Counts

  • Genes not detected in any sample (counts are zero) should not be considered
  • Genes with very low counts are not likely to be informative
  • Eliminating these genes from consideration reduce multiple hypothesis testing burden later
  • General approach: set read count criteria to filter out genes

Filtering Undetected Genes

  • Genes not detected in any sample should be filtered
nonzero_genes <- rowSums(counts[-1])!=0
filtered_counts <- counts[nonzero_genes,]
slice_head(filtered_counts, n=5)
## # A tibble: 5 × 9
##   gene                  P0_1  P0_2  P4_1  P4_2  P7_1  P7_2  Ad_1  Ad_2
##   <chr>                <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSMUSG00000051951.6    19    24    17    17    17    12     7     5
## 2 ENSMUSG00000102851.2     0     0     0     0     1     0     0     0
## 3 ENSMUSG00000103377.2     1     0     0     1     0     1     1     0
## 4 ENSMUSG00000104017.2     0     3     0     0     0     1     0     0
## 5 ENSMUSG00000103201.2     0     0     0     0     0     0     0     1

Filtering Very Low Count Genes

  • Genes with fewer than 2 reads across all samples filtered
nonzero_genes <- rowSums(counts[-1])>=2
filtered_counts <- counts[nonzero_genes,]
slice_head(filtered_counts, n=5)
## # A tibble: 5 × 9
##   gene                   P0_1  P0_2  P4_1  P4_2  P7_1  P7_2  Ad_1  Ad_2
##   <chr>                 <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSMUSG00000051951.6     19    24    17    17    17    12     7     5
## 2 ENSMUSG00000103377.2      1     0     0     1     0     1     1     0
## 3 ENSMUSG00000104017.2      0     3     0     0     0     1     0     0
## 4 ENSMUSG00000102331.2      5     8     2     6     6    11     4     3
## 5 ENSMUSG00000025900.14     4     3     6     7     7     3    36    35

Filtering by Non-zero Sample Counts

  • Statistical procedures don’t work with too many zeros
  • Small read counts may indicate very lowly abundant but biologically relevant genes
  • Can filter genes based on number of non-zero sample counts
    • e.g. Filter out genes if they have zero counts in more than half of samples

Filtering by Non-zero Sample Counts

Filtering by Non-zero Sample Counts

  • Consider genes present in at least n samples
  • Plots distribution of mean counts within each number of nonzero samples:

Mean-variance Relationship

  • In counts data, genes with higher mean count also have higher variance
  • Mean-variance dependence means these count data are heteroskedastic

Filtering For Statistics

  • Genes with very few counts do not enable reliable statistical inference
  • There is no biological meaning to any filtering threshold
  • The read counts for a gene depends upon the total number of reads generated
  • We cannot use filtering to “remove lowly expressed genes”
  • Can only filter out genes that are below the detection threshold afforded by the library size of the dataset

Count Distributions

  • The range of gene expression in a cell varies by orders of magnitude
  • Consider the distribution of read counts on a log10 scale from one sample
dplyr::select(filtered_counts, Ad_1) %>%
  filter(Ad_1 > 0) %>% # avoid taking log10(0)
  mutate(`log10(counts)`=log10(Ad_1)) %>%
  ggplot(aes(x=`log10(counts+1)`)) +
  geom_histogram(bins=50) +
  labs(title='log10(counts) distribution for Adult mouse')

Count Distributions

Count Distributions

  • Consider distribution for all the samples as a ridgeline plot

Count Normalization

  • Every sample will have a different number of reads (library size)
  • The number of reads that maps to any given gene is dependent upon:
    • the relative abundance of the molecule in the sample
    • the library size of each sample
  • The raw read counts are not directly comparable between samples
  • Counts in each sample must be normalized so they are comparable

Count Normalization

  • Count normalization: the process by which the number of raw counts in each sample is scaled by a factor to make multiple samples comparable
  • Many strategies have been proposed to do this
  • The simplest is: divide every count by some factor of the library size
    • e.g. compute counts per million reads or CPM normalization

cpms,i=cs,ils106

Counts Per Million

cpms,i=cs,ils106

  • cs,i= raw count of gene i in sample s
  • ls= library size for sample s
  • In principle, the proportion of read counts for each gene will be comparable between samples

CPM Normalization

Library Size Normalization

  • CPM is a library size normalization
  • Library size normalizations are sensitive to extreme outlier genes
    • e.g. if one gene in one sample has abnormally large counts, this can cause other gene counts to be smaller than they truly are
  • These outlier counts are common in RNASeq data!
  • Normalization method should be robust to these outliers

DESeq2 Normalization

  • DESeq2 bioconductor package introduced a robust normalization method
  • Assumes that most genes are not differentially expressed
  • Uses the median geometric mean computed across all samples to determine the scale factor for each sample
  • Currently the de facto standard normalization method for well characterized genomes

DESeq2 Normalization

library(DESeq2)
# DESeq2 requires a counts matrix
# column data (sample information), and a formula
# the counts matrix *must be raw counts*
count_mat <- as.matrix(filtered_counts[-1])

row.names(count_mat) <- filtered_counts$gene

dds <- DESeqDataSetFromMatrix(
  countData=count_mat,
  colData=tibble(sample_name=colnames(filtered_counts[-1])),
  design=~1 # no formula needed, ~1 produces a trivial design matrix
)

DESeq2 Normalization

# compute normalization factors
dds <- estimateSizeFactors(dds)

# extract the normalized counts
deseq_norm_counts <- as_tibble(counts(dds,normalized=TRUE)) %>%
  mutate(gene=filtered_counts$gene) %>%
  relocate(gene) # relocate changes column order

DESeq2 vs CPM

DESeq2 Considerations

The DESeq2 normalization procedure has two important considerations:

  • The procedure borrows information across all samples.
  • The procedure does not use genes with any zero counts.

The CPM normalization procedure does not borrow information across all samples, and therefore is not subject to these considerations.

Count Transformation

  • One way to deal with the non-normality of count data is transform it to follow a normal distribution
  • Enables more statistical methods like linear regression to be used
  • Count data are roughly log-normal, however low and zero counts can be problematic for traditional log()
  • The DESeq2 package provides the rlog() function that performs a regularized logarithmic transformation that avoids these biases
# the dds is the DESeq2 object from earlier
rld <- rlog(dds)

Count Transformation

Differential Expression Analysis

Differential Expression Analysis

  • Goal: identify to what extent gene expression is associated with one or more variables of interest
  • Typically analyze whole transcriptome (i.e. 1000s of genes)
  • Two required components:
    • an expression matrix (e.g. genes x samples)
    • a design matrix

The Design Matrix

  • A design matrix is a numeric matrix that contains
    • the variables we wish to model
    • any covariates or confounders we wish to adjust for
  • Encoded in a way that statistical procedures understand
  • Construct these matrices from a tibble with the model.matrix() function

Example Metadata

ad_metadata 
## # A tibble: 6 × 8
##   ID    age_at_death condition   tau  abeta   iba1   gfap braak_stage
##   <chr>        <dbl> <fct>     <dbl>  <dbl>  <dbl>  <dbl> <fct>      
## 1 A1              73 AD        96548 176324 157501  78139 4          
## 2 A2              82 AD        95251      0 147637  79348 4          
## 3 A3              82 AD        40287  45365  17655 127131 2          
## 4 C1              80 Control   62684  93739 131595 124075 3          
## 5 C2              77 Control   63598  69838   7189  35597 3          
## 6 C3              72 Control   52951  19444  54673  17221 2

Case vs Control Design

  • Model matrix to identify genes that are increased or decreased in people with AD compared with Controls
  • The ~ condition argument is an an R formula
model.matrix(~ condition, data=ad_metadata)

Case vs Control Design

model.matrix(~ condition, data=ad_metadata)
##   (Intercept) conditionAD
## 1           1           1
## 2           1           1
## 3           1           1
## 4           1           0
## 5           1           0
## 6           1           0
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$condition
## [1] "contr.treatment"

Formulas in R

The general format of a formula is as follows:

# portions in [] are optional
[<outcome variable>] ~ <predictor 1> [+ <predictor 2>]...

# examples

# model Gene 3 expression as a function of disease status
`Gene 3` ~ condition

# model the amount of tau pathology as a function of abeta pathology,
# adjusting for age at death
tau ~ age_at_death + abeta

# can create a model without an outcome variable
~ age_at_death + condition

More Complex Designs

  • Can include other variables in model to adjust for covariates or test other contrasts
  • e.g. adjust out the effect of age by including age_at_death as a covariate in the model:
model.matrix(~ age_at_death + condition, data=ad_metadata)

More Complex Designs

model.matrix(~ age_at_death + condition, data=ad_metadata)
##   (Intercept) age_at_death conditionAD
## 1           1           73           1
## 2           1           82           1
## 3           1           82           1
## 4           1           80           0
## 5           1           77           0
## 6           1           72           0
## attr(,"assign")
## [1] 0 1 2
## attr(,"contrasts")
## attr(,"contrasts")$condition
## [1] "contr.treatment"

Differential Expression: Microarrays (limma)

Differential Expression: Microarrays (limma)

  • limma: linear models of microarrays
  • Designed for analyzing microarray gene expression data
  • One of the top most downloaded Bioconductor packages
  • Supports arbitrarily complex experimental designs while maintaining strong statistical power
  • Can perform reliable inference even with small sample sizes

limma Example Setup

  • limma requires:
    • ExpressionSet or SummarizedExperiment container
    • a design matrix
ad_se  <- SummarizedExperiment(
  assays=list(intensities=intensities),
  colData=ad_metadata,
  rowData=rownames(intensities)
)

# design matrix for AD vs control, adjusting for age at death
ad_vs_control_model <- model.matrix(
  ~ age_at_death + condition,
  data=ad_metadata
)

limma Example: fit model

# now run limma
# first fit all genes with the model
fit <- limma::lmFit(
  assays(se)$intensities,
  ad_vs_control_model
)

# now better control fit errors with the empirical Bayes method
fit <- limma::eBayes(fit)

limma Results

  • Look at top results using limma::topTable() function:
# the coefficient name conditionAD is the column name in the design matrix
topTable(fit, coef="conditionAD", adjust="BH", number=5)
##                  logFC  AveExpr         t      P.Value adj.P.Val         B
## 206354_at   -2.7507429 5.669691 -5.349105 3.436185e-05 0.9992944 -2.387433
## 234942_s_at  0.9906708 8.397425  5.206286 4.726372e-05 0.9992944 -2.447456
## 243618_s_at -0.6879431 7.148455 -4.864695 1.020980e-04 0.9992944 -2.598600
## 223703_at    0.8693657 6.154392  4.745033 1.340200e-04 0.9992944 -2.654092
## 244391_at   -0.5251423 5.532712 -4.691482 1.514215e-04 0.9992944 -2.679354

Differential Expression: RNASeq

Differential Expression: RNASeq

  • Normalized read counts are suitable for differential expression
  • Counts matrix created by concatenating read counts for all genes rows are genes or transcripts and columns are samples
  • Read counts not normally distributed, requires statistical approach that either
    1. Models counts explicitly
    2. Transform counts to be normally distributed, use common statistical methods

Modeling Count Data

  • Poisson distribution: expresses probability that a given number of events occur in a fixed period, e.g. time, space, distance, area, volume
  • Can model number of reads aligning to a given genome position/locus
  • Whole Genome Sequencing (WGS) data are modeled this way

Poisson Distribution

f(k;λ)=Pr(X=k)=λkeλk!

  • λ is the average number of events observed in each period

λ=E(X)=Var(X)

  • In Poisson, the mean and variance are assumed to be equal

WGS Read Depth

WGS Read Depth Distribution

RNASeq Read Depth

Mean-variance Relationship

  • In counts data, genes with higher mean count also have higher variance

Negative Binomial Distribution

  • In a set of Bernoulli trials, models the number of failures before a specified number of successes occurs

f(k; r,p) = \mathtt{Pr}(X=k) = \binom{k+r-1}{k} (1-p)^k p^r

  • r - number of “successes”, p - probability of success
  • Alternate formulation:

p = \frac{\mu}{\sigma^2}, r = \frac{\mu^2}{\sigma^2 - \mu}

DESeq2 Dispersion Estimation

  • Negative Binomial Distribution can be expressed in terms of mean and variance
  • Index of dispersion D - defines relationship between mean and variance:

D = \frac{\sigma^2}{\mu}

DESeq2 Dispersion Estimation

DESeq2/EdgeR

  • DESeq2 and edgeR are bioconductor packages that both implement negative binomial regression for RNASeq data
  • Both perform raw counts normalization
    • DESeq2 - described earlier
    • edgeR - trimmed mean of M-values (TMM)

DESeq2

As mentioned briefly above, DESeq2 requires three pieces of information to perform differential expression:

  1. A raw counts matrix with genes as rows and samples as columns
  2. A data frame with metadata associated with the columns
  3. A design formula for the differential expression model

DESeq2 Example

library(DESeq2)

# the counts matrix *must be raw counts*
count_mat <- as.matrix(filtered_counts[-1])
row.names(count_mat) <- filtered_counts$gene

# create a sample matrix from sample names
sample_info <- tibblesample_name=colnames(filtered_counts[-1])) %>%
  separate(sample_name,
    c("timepoint","replicate"),sep="_",remove=FALSE
  ) %>%
  mutate(
    timepoint=factor(timepoint,levels=c("Ad","P0","P4","P7"))
  )

design <- formula(~ timepoint)

DESeq2 Example

sample_info
## # A tibble: 8 × 3
##   sample_name timepoint replicate
##   <chr>       <fct>     <chr>    
## 1 P0_1        P0        1        
## 2 P0_2        P0        2        
## 3 P4_1        P4        1        
## 4 P4_2        P4        2        
## 5 P7_1        P7        1        
## 6 P7_2        P7        2        
## 7 Ad_1        Ad        1        
## 8 Ad_2        Ad        2

DESeq2 Example

dds <- DESeqDataSetFromMatrix(
  countData=count_mat,
  colData=sample_info,
  design=design
)
dds <- DESeq(dds)
resultsNames(dds)
## [1] "Intercept"          "timepoint_P0_vs_Ad" "timepoint_P4_vs_Ad"
## [4] "timepoint_P7_vs_Ad"

DESeq2 Results

res <- results(dds, name="timepoint_P0_vs_Ad")
p0_vs_Ad_de <- as_tibble(res) %>%
  mutate(gene=rownames(res)) %>%
  relocate(gene) %>%
  arrange(pvalue)
head(p0_vs_Ad_de, 5)
## # A tibble: 5 × 7
##   gene                  baseMean log2FoldChange lfcSE  stat    pvalue      padj
##   <chr>                    <dbl>          <dbl> <dbl> <dbl>     <dbl>     <dbl>
## 1 ENSMUSG00000000031.17   19624.           6.75 0.187  36.2 2.79e-286 5.43e-282
## 2 ENSMUSG00000026418.17    9671.           9.84 0.283  34.7 2.91e-264 2.83e-260
## 3 ENSMUSG00000051855.16    3462.           5.35 0.155  34.5 4.27e-261 2.76e-257
## 4 ENSMUSG00000027750.17    9482.           4.05 0.123  33.1 1.23e-239 5.98e-236
## 5 ENSMUSG00000042828.13    3906.          -5.32 0.165 -32.2 3.98e-227 1.55e-223

DESeq2 Results

  • baseMean - the mean normalized count of all samples for the gene
  • log2FoldChange - the estimated coefficient (i.e. log2FoldChange) for the comparison of interest
  • lfcSE - the standard error for the log2FoldChange estimate
  • stat - the Wald statistic associated with the log2FoldChange estimate
  • pvalue - the nominal p-value of the Wald test (i.e. the signifiance of the association)
  • padj - the multiple testing adjusted p-value (i.e. false discovery rate)

Filtering DESeq2 Results

# number of genes significant at FDR < 0.05
filter(p0_vs_Ad_de,padj<0.05)
## # A tibble: 7,467 × 7
##    gene                  baseMean log2FoldChange lfcSE  stat    pvalue      padj
##    <chr>                    <dbl>          <dbl> <dbl> <dbl>     <dbl>     <dbl>
##  1 ENSMUSG00000000031.17   19624.           6.75 0.187  36.2 2.79e-286 5.43e-282
##  2 ENSMUSG00000026418.17    9671.           9.84 0.283  34.7 2.91e-264 2.83e-260
##  3 ENSMUSG00000051855.16    3462.           5.35 0.155  34.5 4.27e-261 2.76e-257
##  4 ENSMUSG00000027750.17    9482.           4.05 0.123  33.1 1.23e-239 5.98e-236
##  5 ENSMUSG00000042828.13    3906.          -5.32 0.165 -32.2 3.98e-227 1.55e-223
##  6 ENSMUSG00000046598.16    1154.          -6.08 0.197 -30.9 5.77e-210 1.87e-206
##  7 ENSMUSG00000019817.19    1952.           5.15 0.170  30.3 1.14e-201 3.18e-198
##  8 ENSMUSG00000021622.4    17146.          -4.57 0.155 -29.5 5.56e-191 1.35e-187
##  9 ENSMUSG00000002500.16    2121.          -6.20 0.216 -28.7 2.36e-181 5.10e-178
## 10 ENSMUSG00000024598.10    1885.           5.90 0.208  28.4 1.46e-177 2.83e-174
## # ℹ 7,457 more rows

Volcano Plot

  • Volcano plots are log2 fold change vs -log10(p-value)
mutate(
  p0_vs_Ad_de,
  `-log10(adjusted p)`=-log10(padj),
  `FDR<0.05`=padj<0.05
  ) %>%
  ggplot(aes(x=log2FoldChange,y=`-log10(adjusted p)`,color=`FDR<0.05`)) +
  geom_point()

Volcano Plot

## Warning: Removed 9551 rows containing missing values or values outside the scale range
## (`geom_point()`).

limma/voom

  • limma package (for microarrays) was extended to support RNASeq
  • Transforms count with voom transformation
    1. Compute CPM
    2. Take log of CPM
    3. Estimate mean-variance relationship to control errors
  • Uses linear model on transformed counts

limma/voom Example

design <- data.frame(swirl = c("swirl.1", "swirl.2", "swirl.3", "swirl.4"),
                     condition = c(1, -1, 1, -1))
dge <- DGEList(counts=counts)
keep <- filterByExpr(dge, design)
dge <- dge[keep,,keep.lib.sizes=FALSE]

limma/voom results

  • Use topTable() as in microarray limma results:
# limma trend
logCPM <- cpm(dge, log=TRUE, prior.count=3)
fit <- lmFit(logCPM, design)
fit <- eBayes(fit, trend=TRUE)
topTable(fit, coef=ncol(design))