Interpreting Gene Expression

Gene Annotations

  • Individual gene studies can characterize:
    • function
    • localization
    • structure
    • interactions
    • chemical properties
    • dynamics
  • Genes are annotated with these properties via different databases
  • Some annotations consolidated into centralized metadatabases
  • e.g.

Example Gene Card

Gene Ontology (GO)

  • Ontology: a controlled vocabulary of biological concepts
  • The Gene Ontology (GO): a set of terms that describes what genes are, do, etc
  • Each GO Term has a code like GO:NNNNNNN, e.g. GO:0019319
  • GO Terms subdivided into three namespaces:
    • Biological Process (BP) - pathways, etc
    • Molecular Function (MF) - enzymatic activity, DNA binding, etc
    • Cellular Component (CC) - nucleus, cell membrane, etc
  • Terms within each namespace are hierarchical, form a directed acyclic graph (DAG)

Example GO Term

GO Annotation

  • GO Terms can apply to genes from all biological systems
  • The GO itself does not contain gene annotations
  • GO annotations are provided/maintained separately for each organism
  • A gene may be annotated to many GO terms
  • A GO term may be annotated to many genes

Individual to Many Genes

  • High throughput gene expression studies implicate many genes
  • What biological processes are implicated by a differential expression analysis?
  • Idea: examine the annotations of all implicated genes and look for patterns

Gene Sets

  • Genes can be organized by different attributes:
    • biochemical function, e.g. enzymatic activity
    • biological process, e.g. pathways
    • localization, e.g. nucleus
    • disease association
    • chromosomal locus
    • defined by differential expression studies!
    • any other reasonable grouping
  • A gene set is a group of genes related in one way or another
  • e.g. all genes annotated to GO Term GO:0019319 - hexose biosynthetic process

Gene Set Databases

  • Gene set database: a collection of gene sets
  • Different databases organize/maintain different sets of genes for different purposes

GO Annotations

  • Available at for many species
  • Programmatic access as well
  • Provided in tab-delimited GAF (GO Annotation Format) files



KEGG: Huntington’s Disease

  • Molecular Signatures Data Base
  • Originally, gene sets associated with cancer
  • Contains 9 collections of gene sets:
    • H - well defined biological states/processes
    • C1 - positional gene sets
    • C2 - curated gene sets
    • C3 - regulatory targets
    • C4 - computational gene sets
    • C5 - GO annotations
    • C6 - oncogenic signatures
    • C7 - immunologic signatures
    • C8 - cell type signatures

.gmt - Gene Set File Format

  • .gmt - Gene Matrix Transpose
  • Defined by Broad Institute for use in its GSEA software
  • Contains gene sets, one per line
  • Tab-separated format with columns:
    • 1st column: Gene set name
    • 2nd column: Gene set description (often blank)
    • 3rd and on: gene identifiers for genes in set

GMT File

GMT Files in R

hallmarks_gmt <- getGmt(con='h.all.v7.5.1.symbols.gmt')
## GeneSetCollection
##   unique identifiers: JUNB, CXCL2, ..., SRP14 (4383 total)
##   types in collection:
##     geneIdType: NullIdentifier (1 total)
##     collectionType: NullCollection (1 total)

Gene Set Enrichment Analysis

  • Compare
    • a gene list of interest (e.g. DE gene list) with
    • a gene set (e.g. genes annotated to GO:0019319)
  • Are the genes in our gene list have more similarity to the genes in the gene set than we expect by chance?

Gene Set Enrichment Flavors

  • Over-representation: does our gene list overlap a gene set more than expected by chance?
  • Rank-based: are the gene in a gene set more increased/decreased in our gene list than expected by chance?


  • Useful with a list of “genes of interest” e.g. DE genes at FDR < 0.05
  • Compute overlap with genes in a gene set
  • Is the overlap greater than we would expect by chance?

Hypergeometric/Fisher’s Exact Test

Hypergeometric/Fisher’s Exact Test

contingency_table <- matrix(c(13, 987, 23, 8977), 2, 2)
fisher_results <- fisher.test(contingency_table, alternative='greater')
##  Fisher's Exact Test for Count Data
## data:  contingency_table
## p-value = 2.382e-05
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  2.685749      Inf
## sample estimates:
## odds ratio 
##   5.139308

Rank Based: GSEA

  • Inputs:
    • all genes irrespective of significance, ranked by a statistic e.g. log2 fold change
    • gene set database
  • Examines whether genes in each gene set are more highly or lowly ranked than expected by chance
  • Computes a Kolmogorov-Smirnov test to determine signficance
  • Produces Normalized Enrichment Score
    • Positive if gene set genes are at the top of the sorted list
    • Negative if at bottom
  • Official software: standalone JAVA package

Rank Based: GSEA


  • fgsea package implements the GSEA preranked algorithm in R
  • Requires
    • List of ranked genes
    • Database of gene sets

Statistical Distributions

  • Statistics was created to quantify uncertainty
  • Statistics provides tools to separate signal from noise
  • The statistical distribution is a tool we can use to estimate and quantify uncertainty

Random Variables

  • A random variable is an object or quantity which:
    • depends upon random events
    • can have samples drawn from it
  • Each random variable has a potential set of possible outcomes (e.g. a real number, an integer, a category, a mathematical tree)
  • Each possible outcome has some probability of appearing, relative to other outcomes
  • The complete mapping of relative probabilities to outcomes for a random variable forms a distribution

Random Variable Examples

  • a six-sided die
  • a coin
  • transcription of a gene

Random Variable Notation

  • Usually notated as capital letters, like \(X,Y,Z,\) etc
  • A sample drawn from a random variable is usually notated as a lowercase of the same letter
  • e.g. \(x\) is a sample drawn from the random variable \(X\)
  • Distribution of random variable usually described like
    • “\(X\) follows a binomial distribution”
    • “\(Y\) is a normally distributed random variable”
  • Probability of a random variable taking one of its possible values: \(P(X = x)\)

Statistical Distribution Basics

  • A statistical distribution is a function that maps the possible values for a variable to how often they occur
  • Alternatively, distribution describes the probability of seeing a single value, or a range of values, relative to all other possible values
  x = seq(-4,4,by=0.1),
  `Probability Density`=dnorm(x,0,1)
) %>%
  ggplot(aes(x=x,y=`Probability Density`)) +
  geom_line() +
  labs(title="Probability Density Function for a Normal Distribution")

Normal Distribution

Probability Density Functions

  • probability density function (PDF) defines probability associated with every possible value of a random variable
  • Many PDFs have a closed mathematical form
  • The PDF for the normal distribution is:

\[ P(X = x|\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}e^{\frac{-(x-\mu)^2}{2\sigma}} \]


\[ P(X = x|\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}e^{\frac{-(x-\mu)^2}{2\sigma}} \]

  • Notation \(P(X = x|\mu,\sigma)\) - “the probability that the random variable \(X\) takes the value \(x\), given mean \(\mu\) and standard deviation \(\sigma\)”
  • Normal distribution is a parametric distribution because PDF requires two parameters
  • Non-parametric distributions do not require parameters, but are determined from data

Probability Mass Functions

  • PDFs only defined for continuous distributions, with infinite possible values
  • Discrete distributions have a finite set of possible values
  • Probability mass functions discrete distribution analog for PDFs

Note: Probability of Zero

  • In probability theory, if a plausible event has a probability of zero, this does not mean that event can never occur
  • Every specific value in a continuous distribution that supports all real numbers has a probability of zero
  • PDF allows us to reason about the relative likelihood of observing values in one range of the distribution compared with the others

Cumulative Distribution Function

  • PDF provides the probability density of specific values within the distribution
  • Sometimes we want probability of a value being less/greater than or equal to a particular value
  • Cumulative distribution function (CDF) useful for this purpose
  x = seq(-4,4,by=0.1),
) %>%
  ggplot() +
  geom_line(aes(x=x,y=PDF,color="PDF")) +



  • CDF corresponds to area under density curve up to value of \(x\)
  • 1 minus that value is the area under the curve greater than that value

Generating random samples

  • CDF is useful for generating samples from the distribution

Distributions in R

  • Four key operations we perform with distributions:
  1. Calculate probabilities using the PDF
  2. Calculate cumulative probabilities using the CDF
  3. Calculate the value associated with a cumulative probability
  4. Sample values from a parameterized distribution
  • Each of these operations has a dedicated function for each different distribution

Distributions in R

  • Each distribution has a family of 4 functions
  • Functions end with shortened name of distribution, e.g. norm for normal distribution
  • Distribution functions prefixed by d, p, q, and r, e.g.
  1. dnorm(x, mean=0, sd=1) - PDF of the normal distribution
  2. pnorm(q, mean=0, sd=1) - CDF of the normal distribution
  3. qnorm(p, mean=0, sd=1) - inverse CDF; accepts quantiles between 0 and 1 and returns the value of the distribution for those quantiles
  4. rnorm(n, mean=0, sd=1) - generate n samples from a normal distribution

Distributions in R

Distribution Probability Density Function
Normal dnorm(x,mean,sd)
t Distribution dt(x,df)
Poisson dpois(n,lambda)
Binomial dbinom(x, size, prob)
Negative Binomial dnbinom(x, size, prob, mu)
Exponential dexp(x, rate)
\(\chi^2\) dchisq(x, df)

Types of Distributions

  • Broadly 2 types of distributions:
    • Continuous - defined for real numbers within a range
    • Discrete - defined for all objects within a set
  • A distribution may be either:
    • Theoretical (aka parametric) - defined by a parameterized family of mathematical functions
    • Empirical (aka non-parametric) - defined by data

Discrete Distributions

  • Discrete distributions defined over countable, possibly infinite sets
  • Events take one of a set of discrete values
  • Common discrete distributions include
    • binomial (e.g. coin flips)
    • multinomial (e.g. dice rolls)
    • Poisson (e.g. number of injuries by horse kick per day)

Bernoulli random trails

  • Bernoulli trial = a coin flip
  • 2 outcomes with probabilities of \(p\) and \(1-p\)
  • Consider a coin flip:
    • \(Pr(X = 0) = Pr(X = 1) = 0.5\) - coin is fair
    • \(Pr(X = 0) = 0.1, Pr(X = 1) = 0.9\) - coin is not fair
  • Consider dice roll where we count a 6 as a success and any other roll as failure:
    • \(Pr(X = 0) = 5/6, Pr(X = 1) = 1/6\) - if dice are fair

Bernoulli random trials

library(statip) # NB: not a base R distribution
# dbern(x, prob, log = FALSE)
# qbern(p, prob, lower.tail = TRUE, log.p = FALSE)
# pbern(q, prob, lower.tail = TRUE, log.p = FALSE)
# rbern(n, prob)
rbern(10, 0.5)

Bernoulli random trials

rbern(10, 0.5)
##  [1] 0 1 1 1 0 1 0 0 0 0

Binomial distribution

  • Consider a sequence of coin flips, where chance of heads is \(p\)
  • What is the probability of seeing \(x\) heads out of \(n\) flips?
  • The binomial distribution models this situation

\[ Pr(X = x|n,p) = {n \choose x} p^x (1-p) ^{(n-x)} \]

Binomial distribution

# dbinom(x, size, prob, log = FALSE)
# pbinom(q, size, prob, lower.tail = TRUE, log.p = FALSE)
# qbinom(p, size, prob, lower.tail = TRUE, log.p = FALSE)
# rbinom(n, size, prob)
rbinom(10, 10, 0.5)

Binomial distribution

rbinom(10, 10, 0.5)
##  [1] 5 5 4 5 5 7 5 6 5 3
mean(rbinom(1000, 10, 0.5))
## [1] 5

Geometric distribution

  • Consider a sequence of coin flips, where chance of heads is \(p\)
  • What is the probability of seeing \(x\) consecutive tails before a head?
  • The geometric distribution models this situation

\[ Pr(X = x|p) = (1-p)^x p \]

Geometric distribution

# dgeom(x, prob, log = FALSE)
# pgeom(q, prob, lower.tail = TRUE, log.p = FALSE)
# qgeom(p, prob, lower.tail = TRUE, log.p = FALSE)
# rgeom(n, prob)
rgeom(10, 0.5)

Geometric distribution

rgeom(10, 0.5)
##  [1] 2 0 1 2 0 2 0 3 0 2


  • Consider an event that occurs zero or more times in a particular time period at a known constant average rate \(\lambda\)
  • What is the probability of seeing \(k\) events in a time period?

\[ Pr(X=k|\lambda) = \frac {\lambda^k e^{-\lambda}} {k!} \]


# dpois(x, lambda, log = FALSE)
# ppois(q, lambda, lower.tail = TRUE, log.p = FALSE)
# qpois(p, lambda, lower.tail = TRUE, log.p = FALSE)
# rpois(n, lambda)
rpois(10, 5)


rpois(10, 5)
##  [1] 3 5 6 5 3 7 3 2 5 8
mean(rpois(1000, 5))
## [1] 4.935

Negative binomial distribution

  • Consider a sequence of coin flips, where chance of heads is \(p\)
  • What is the probability of seeing \(x\) tails by the time we see the \(r\) heads?

\[ Pr(X = x|r,p) = \frac {x+r-1} {r-1} p^r {(1-p)}^x \]

Negative binomial distribution

# dnbinom(x, size, prob, mu, log = FALSE)
# pnbinom(q, size, prob, mu, lower.tail = TRUE, log.p = FALSE)
# qnbinom(p, size, prob, mu, lower.tail = TRUE, log.p = FALSE)
# rnbinom(n, size, prob, mu)
rnbinom(10, 10, 0.5)

Negative binomial distribution

# dnbinom(x, size, prob, mu, log = FALSE)
# pnbinom(q, size, prob, mu, lower.tail = TRUE, log.p = FALSE)
# qnbinom(p, size, prob, mu, lower.tail = TRUE, log.p = FALSE)
# rnbinom(n, size, prob, mu)
rnbinom(10, 10, 0.5)
##  [1] 18  9  7  6  9  9  3 13 12  4

Continuous Distributions

  • Continuous distributions defined over infinite, possibly bounded domains, e.g. all real numbers

Uniform Distribution

  • Consider a number in the range \([a, b]\)
  • All values in range appear with equal probability

\[ P(X=x|a,b) = \begin{cases} \frac{1}{b-a} & \text{for}\; a \le x \le b, \\ 0 & \text{otherwise} \end{cases} \]

Uniform Distribution

dunif(x, min = 0, max = 1, log = FALSE)
punif(q, min = 0, max = 1, lower.tail = TRUE, log.p = FALSE)
qunif(p, min = 0, max = 1, lower.tail = TRUE, log.p = FALSE)
runif(n, min = 0, max = 1)
runif(10, min=0, max=10)

Uniform Distribution

##  [1] 0.09269301 0.04769325 0.49162694 0.06384177 0.74975727 0.06286865
##  [7] 0.59144210 0.85150450 0.80188802 0.49818460
## [1] 0.495675

Normal Distribution

  • Normal distribution (Gaussian distribution)

\[ P(X = x|\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}e^{\frac{-(x-\mu)^2}{2\sigma}} \]

Normal Distribution

# dnorm(x, mean = 0, sd = 1, log = FALSE)
# pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
# qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
# rnorm(n, mean = 0, sd = 1)
rnorm(10, mean=10, sd=10)

Normal Distribution

rnorm(10, mean=10, sd=10)
##  [1]  16.5110731   4.9263930  24.7133656  26.2699361   3.0473640  14.1095811
##  [7]   9.6754010  29.0091827 -13.9826178   0.4802505
mean(rnorm(1000, mean=10, sd=10))
## [1] 10.22659

\(\chi^2\) Distribution

  • Models the distribution of \(k\) independent standard normal random variables

\[ P(X=x|k) = \begin{cases} \frac{ x^{\frac{k}{2}-1}e^{-\frac{x}{2}} }{ 2^{\frac{k}{2}}\left ( \frac{k}{2} \right ) }, & x>0;\\ 0, & \text{otherwise} \end{cases} \]$

\(\chi^2\) Distribution

# dchisq(x, df, ncp = 0, log = FALSE)
# pchisq(q, df, ncp = 0, lower.tail = TRUE, log.p = FALSE)
# qchisq(p, df, ncp = 0, lower.tail = TRUE, log.p = FALSE)
# rchisq(n, df, ncp = 0)
rchisq(10, 10)

\(\chi^2\) Distribution

rchisq(10, 10)
##  [1] 14.078925 10.028708  5.111300 16.071434  5.637748 10.010057 10.421410
##  [8]  6.069114  7.993634 14.850696
mean(rchisq(1000, 10))
## [1] 9.620238

Empirical Distributions

  • Empirical distributions describe the relative frequency of observed values in a dataset
  • Empirical distributions may have any shape, and may be visualized using any of the methods described in the [Visualizing Distributions] section, e.g. a density plot
  • Empirical distributions can be turned into an empirical distribution function similar to the p* theoretical distribution functions

Empirical Distributions

Empirical CDF

## Empirical CDF 
## Call: ecdf(d$Value)
##  x[1:3000] = -6.6546, -6.494, -6.2441,  ..., 26.064,  30.29