Data Science

Data Modeling

Data modeling goal:

  • describe a dataset using a relatively small number of mathematical relationships, or
  • use some parts of a dataset to try to accurately predict other parts of the dataset

Models are human inventions

Models:

  • reflect our beliefs about the way the universe works
  • can identify patterns within a dataset that are the result of causal relationships
  • separate meaningful signal of interest from noise
  • the model itself does not identify or even accurately reflect those causal effects
  • summarizes patterns and we as scientists are left to interpret those patterns

Data Modeling Principles

  1. Data are never wrong.
  2. Not all data are useful.
  3. “All models are wrong, but some are useful.” - George Box
  4. Data do not contain causal information.
    • i.e. “Correlation does not mean causation.”
  5. All data have noise.

Types of Models

  • Scientific model:
    • conceptual representation of our belief of the universe
    • proposes causal explanations for phenomenon, i.e. hypotheses
    • helps us design experiments and collect data to test the accuracy of the model
  • Statistical model:
    • maps a scientific model onto a mechanical procedure
    • quantifies how well our scientific model explains a dataset

The scientific model and statistical model are related but independent choices we make

Data Exploration

  • We might lack sufficient information to form a scientific model
  • Must first gain knowledge by examining data
  • Can then possibly generate a scientific model
  • This step sometimes called exploratory data analysis (EDA)

A Worked Modeling Example

Consider three imaginary genes measured in Parkinson’s Disease (PD) patients blood:

set.seed(1337)
gene_exp <- tibble(
  sample_name = c(str_c("P",1:100), str_c("C",1:100)),
  `Disease Status` = factor(c(rep("PD",100),rep("Control",100))),
  `Gene 1` = c(rnorm(100,255,10), rnorm(100,250,10)),
  `Gene 2` = c(rnorm(100,520,100), rnorm(100,600,60)),
  `Gene 3` = c(rnorm(100,500,40), rnorm(100,330,40))
)

A Worked Modeling Example

gene_exp
## # A tibble: 200 × 5
##    sample_name `Disease Status` `Gene 1` `Gene 2` `Gene 3`
##    <chr>       <fct>               <dbl>    <dbl>    <dbl>
##  1 P1          PD                   257.     636.     504.
##  2 P2          PD                   241.     556.     484.
##  3 P3          PD                   252.     426.     476.
##  4 P4          PD                   271.     405.     458.
##  5 P5          PD                   248.     482.     520.
##  6 P6          PD                   275.     521.     460.
##  7 P7          PD                   264.     415.     568.
##  8 P8          PD                   276.     450.     495.
##  9 P9          PD                   274.     490.     496.
## 10 P10         PD                   251.     584.     573.
## # ℹ 190 more rows

Data Exploration

pivot_longer(
  gene_exp,
  c(`Gene 1`, `Gene 2`, `Gene 3`),
  names_to="Gene",
  values_to="Expression"
) %>% ggplot(
    aes(x=`Disease Status`,y=Expression,fill=`Disease Status`)
  ) +
  facet_wrap(vars(Gene)) +
  geom_violin()

Data Exploration

Observations

  • Gene 1 does not look different between PD and control
  • Gene 2 looks to have different means, but distributions overlap
  • Gene 3 is very different in PD versus control
  • Qualitative observations: We made them with our eyes
  • How do we quantify these observations?

Questions We Might Ask

  • How much higher is Gene 3 expression in PD than control?
  • If I know/suspect a patient has PD, what Gene 3 expression might we expect them to have?
  • If all I know about a patient is their gene expression, how likely is it that they have PD?

Data Summarization

Data Summarization

  • data summarization - process of finding a lower-dimensional representation of larger dataset
  • A histogram is a summarization:
ggplot(gene_exp, aes(x=`Gene 1`)) +
  geom_histogram(bins=30,fill="#a93c13")

Point Estimates

  • point estimate: data summarized to a single number
  • Represent a data distribution’s central tendency.
  • e.g. `
ggplot(gene_exp, aes(x=`Gene 1`)) +
  geom_histogram(bins=30,fill="#a93c13") +
  geom_vline(xintercept=mean(gene_exp$`Gene 1`))

Poor Point Estimates

library(patchwork)
well_behaved_data <- tibble(data = rnorm(1000))
# introduce some outliers
data_w_outliers <- tibble(data = c(rnorm(800), rep(5, 200))) 

g_no_outlier <- ggplot(well_behaved_data, aes(x = data)) +
  geom_histogram(fill = "#56CBF9", color = "grey", bins = 30) +
  geom_vline(xintercept = mean(well_behaved_data$data)) +
  ggtitle("Mean example, no outliers")

g_outlier <- ggplot(data_w_outliers, aes(x = data)) +
  geom_histogram(fill = "#7FBEEB", color = "grey", bins = 30) +
  geom_vline(xintercept = mean(data_w_outliers$data)) +
  ggtitle("Mean example, big outliers")

g_no_outlier | g_outlier

Poor Point Estimates

Other Point Estimates

g_no_outlier <- ggplot(well_behaved_data, aes(x = data)) +
  geom_histogram(fill = "#AFBED1", color = "grey", bins = 30) +
  geom_vline(xintercept = median(well_behaved_data$data)) +
  ggtitle("Median example")

g_outlier <- ggplot(data_w_outliers, aes(x = data)) +
  geom_histogram(fill = "#7FBEEB", color = "grey", bins = 30) +
  geom_vline(xintercept = median(data_w_outliers$data)) +
  ggtitle("Median example, big outliers")

g_no_outlier | g_outlier

Other Point Estimates

Dispersion

  • Dispersion describes the “spread” of the data
  • e.g. standard deviation sd()
g1_mean <- mean(gene_exp$`Gene 1`)
g1_sd <- sd(gene_exp$`Gene 1`)
ggplot(gene_exp, aes(x=`Gene 1`)) +
  geom_histogram(bins=30,fill="#a93c13") +
  geom_vline(xintercept=g1_mean) +
  geom_segment(x=g1_mean-g1_sd, y=10, xend=g1_mean+g1_sd, yend=10)

Standard deviation

  • measure how close values are to the mean
data <- tibble(data = c(rnorm(1000, sd=1.75)))
ggplot(data, aes(x = data)) +
  geom_histogram(fill = "#EAC5D8", color = "white", bins = 30) +
  geom_vline(xintercept = seq(-3,3,1) * sd(data$data)) +
  xlim(c(-6, 6)) +
  ggtitle("Standard deviations aplenty", paste("SD:", sd(data$data)))

Standard deviation

Normal Distribution

  • Central tendency and dispersion estimates describe a normal distribution
  • Can visually inspect fit of a distribution to data:
g1_mean <- mean(gene_exp$`Gene 1`)
g1_sd <- sd(gene_exp$`Gene 1`)
ggplot(gene_exp, aes(x=`Gene 1`)) +
  geom_histogram(
    aes(y=after_stat(density)),
    bins=30,
    fill="#a93c13"
  ) +
  stat_function(fun=dnorm,
    args = list(mean=g1_mean, sd=g1_sd),
    linewidth=2
  )

Normal Distribution

Our First Model

  • We chose to express the dataset as a normal distribution parameterized by:
    • arithmetic mean - 254
    • standard deviation - 11

\[ Gene\;1 \sim \mathcal{N}(254, 11) \]

Other distributions

g_norm <- ggplot(tibble(data = rnorm(5000)), aes(x = data)) +
  geom_histogram(fill = "#D0FCB3", bins = 50, color = "gray") +
  ggtitle("Normal distribution", "rnorm(n = 1000)")

g_unif <- ggplot(tibble(data = runif(5000)), aes(x = data)) +
  geom_histogram(fill = "#271F30", bins = 50, color = "white") +
  ggtitle("Uniform distribution", "runif(n = 1000)")

g_logistic <- ggplot(tibble(data = rlogis(5000)), aes(x = data)) +
  geom_histogram(fill = "#9BC59D", bins = 50, color = "black") +
  ggtitle("Logistic distribution", "rlogis(n = 1000)")

g_exp <- ggplot(tibble(data = rexp(5000, rate = 1)), aes(x = data)) +
  geom_histogram(fill = "#6C5A49", bins = 50, color = "white") +
  ggtitle("Exponential distribution", "rexp(n = 1000, rate = 1)")

(g_norm | g_unif) / (g_logistic | g_exp)

Other distributions

Some notes on our model

We chose to model our gene 1 expression data using a normaly distribution parameterized by the arithmetic mean and standard deviation

  1. Our model choice was totally subjective
  2. We can’t know if this is the “correct” model for the data
  3. We don’t know how well our model describes the data yet

Linear Models

Linear Models

  • Data distributions are descriptive
  • They may not be informative
  • Recall one of our scientific questions:
    • If all I know about a patient is their gene expression, how likely is it that they have PD?
  • Our model doesn’t answer this question yet

The Linear Model

  • linear model: any statistical model that relates one outcome variable as a linear combination (i.e. sum) of one or more explanatory variables
  • For the mathematically inclined:

\[ Y_i = \beta_0 + \beta_1 \phi_1 ( X_{i1} ) + \beta_2 \phi_2 ( X_{i2} ) + \ldots + \beta_p \phi_p ( X_{ip} ) + \epsilon_i \]

Back To Working Example

library(ggbeeswarm)
ggplot(gene_exp, aes(x=`Disease Status`, y=`Gene 3`, color=`Disease Status`)) +
  geom_beeswarm()

Plot Same Data Differently

exp_summ <- pivot_longer( gene_exp, c(`Gene 3`)
  ) %>%
  group_by(`Disease Status`) %>%
  summarize(mean=mean(value),sd=sd(value))

pd_mean <- filter(exp_summ, `Disease Status` == "PD")$mean
c_mean <- filter(exp_summ, `Disease Status` == "Control")$mean

ggplot(gene_exp, aes(x=`Gene 3`, fill=`Disease Status`)) +
  geom_histogram(bins=20, alpha=0.6,position="identity") +
  annotate("segment", x=c_mean, xend=pd_mean, y=20, yend=20,
    arrow=arrow(ends="both", angle=90)) +
  annotate("text", x=mean(c(c_mean,pd_mean)), y=21, hjust=0.5,
    label="How different?")

Plot Same Data Differently

Estimate Difference

  • make a point estimate of this difference by simply subtracting the means:
pd_mean - c_mean
## [1] 164.0942

In other words, this point estimate suggests that on average Parkinson’s patients have 164.1 greater expression than Controls.

Visualizing this Difference

ggplot(gene_exp, aes(x=`Disease Status`, y=`Gene 3`,
                     color=`Disease Status`)) +
  geom_beeswarm() +
  annotate("segment", x=0, xend=3,
           y=2*c_mean-pd_mean, yend=2*pd_mean-c_mean)

Linear Model Terminology

  • Our difference is an estimate, but how good is it?
  • Linear models estimate both the difference and our confidence in that estimate
  • Mathematically:
    • \[Gene 3 = \beta_0 + \beta_1*Disease Status\]
  • \(\beta_0\) and \(\beta_1\) are called coefficients
    • \(\beta_0\) is our intercept term
    • \(\beta_1\) is our point estimate of difference between PD and Control
  • Computing these coefficients is called fitting the model
  • This style of analysis is called linear regression

Linear Models in R

  • In R, specify formula like Y ~ X, or Gene 3 ~ Disease Status
fit <- lm(`Gene 3` ~ `Disease Status`, data=gene_exp)
fit
## 
## Call:
## lm(formula = `Gene 3` ~ `Disease Status`, data = gene_exp)
## 
## Coefficients:
##        (Intercept)  `Disease Status`PD  
##              334.6               164.1

lm() Fit

summary(fit)
## 
## Call:
## lm(formula = `Gene 3` ~ `Disease Status`, data = gene_exp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -124.303  -25.758   -2.434   30.518  119.348 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         334.578      4.414   75.80   <2e-16 ***
## `Disease Status`PD  164.094      6.243   26.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.14 on 198 degrees of freedom
## Multiple R-squared:  0.7773, Adjusted R-squared:  0.7761 
## F-statistic:   691 on 1 and 198 DF,  p-value: < 2.2e-16

Computing Fit: naive

# naive solution
fit1 <- lm(`Gene 1` ~ `Disease Status`, data=gene_exp)
fit2 <- lm(`Gene 2` ~ `Disease Status`, data=gene_exp)
fit3 <- lm(`Gene 3` ~ `Disease Status`, data=gene_exp)

Computing Fit

gene_stats <- lapply(
  c("Gene 1", "Gene 2", "Gene 3"),
  function(gene) {
    model <- paste0('`',gene,'`',' ~ `Disease Status`')
    fit <- lm(model, data=gene_exp)
    v <- c(gene,
           coefficients(fit),
           summary(fit)$coefficients[2,4]
           )
    names(v) <- c("Gene", "Intercept", "PD", "pvalue")
    return(v)
  }
  ) %>%
  bind_rows() # turn the list into a tibble

# compute FDR from nominal p-values
gene_stats$padj <- p.adjust(gene_stats$pvalue,method="fdr")

Computed Fit

gene_stats
## A tibble: 3 × 5
#  Gene   Intercept  PD       pvalue      padj
#  <chr>  <chr>      <chr>    <chr>       <dbl>
#1 Gene 1 250.410    6.959    3.921e-06   3.92e- 6
#2 Gene 2 597.763    -93.629  2.373e-13   3.56e-13
#3 Gene 3 334.577    164.094  1.727e-66   5.18e-66

Plotting fit

Generalized Linear Models

  • Linear regression makes strong assumptions about the distribution of the data
  • Generalized linear models that allow these assumptions to be relaxed
  • Mathematically, function \(g\) allows different relationships between \(X\) and \(Y\):

\[Y = g^{-1}(\beta_0 + \beta_1 \phi_1 ( X_{i1} ) + \beta_2 \phi_2 ( X_{i2} ) + \ldots + \beta_p \phi_p ( X_{ip} ))\]

Flavors of Linear Models

  • Logistic regression
  • Multinomial regression
  • Poisson regression
  • Negative binomial regression

Statistical Distributions

Statistical Distributions

  • Statistics was created to quantify uncertainty
  • Statistics provides tools to separate signal from noise
  • The statistical distribution is a tool we can use to estimate and quantify uncertainty

Random Variables

  • A random variable is an object or quantity which:
    • depends upon random events
    • can have samples drawn from it
  • Each random variable has a potential set of possible outcomes (e.g. a real number, an integer, a category, a mathematical tree)
  • Each possible outcome has some probability of appearing, relative to other outcomes
  • The complete mapping of relative probabilities to outcomes for a random variable forms a distribution

Random Variable Examples

  • a six-sided die
  • a coin
  • transcription of a gene

Random Variable Notation

  • Usually notated as capital letters, like \(X,Y,Z,\) etc
  • A sample drawn from a random variable is usually notated as a lowercase of the same letter
  • e.g. \(x\) is a sample drawn from the random variable \(X\)
  • Distribution of random variable usually described like
    • “\(X\) follows a binomial distribution”
    • “\(Y\) is a normally distributed random variable”
  • Probability of a random variable taking one of its possible values: \(P(X = x)\)

Statistical Distribution Basics

  • A statistical distribution is a function that maps the possible values for a variable to how often they occur
  • Alternatively, distribution describes the probability of seeing a single value, or a range of values, relative to all other possible values
tibble(
  x = seq(-4,4,by=0.1),
  `Probability Density`=dnorm(x,0,1)
) %>%
  ggplot(aes(x=x,y=`Probability Density`)) +
  geom_line() +
  labs(title="Probability Density Function for a Normal Distribution")

Normal Distribution

Probability Density Functions

  • probability density function (PDF) defines probability associated with every possible value of a random variable
  • Many PDFs have a closed mathematical form
  • The PDF for the normal distribution is:

\[ P(X = x|\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}e^{\frac{-(x-\mu)^2}{2\sigma}} \]

PDFs

\[ P(X = x|\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}e^{\frac{-(x-\mu)^2}{2\sigma}} \]

  • Notation \(P(X = x|\mu,\sigma)\) - “the probability that the random variable \(X\) takes the value \(x\), given mean \(\mu\) and standard deviation \(\sigma\)”
  • Normal distribution is a parametric distribution because PDF requires two parameters
  • Non-parametric distributions do not require parameters, but are determined from data

Probability Mass Functions

  • PDFs only defined for continuous distributions, with infinite possible values
  • Discrete distributions have a finite set of possible values
  • Probability mass functions discrete distribution analog for PDFs

Note: Probability of Zero

  • In probability theory, if a plausible event has a probability of zero, this does not mean that event can never occur
  • Every specific value in a continuous distribution that supports all real numbers has a probability of zero
  • PDF allows us to reason about the relative likelihood of observing values in one range of the distribution compared with the others

Cumulative Distribution Function

  • PDF provides the probability density of specific values within the distribution
  • Sometimes we want probability of a value being less/greater than or equal to a particular value
  • Cumulative distribution function (CDF) useful for this purpose
tibble(
  x = seq(-4,4,by=0.1),
  PDF=dnorm(x,0,1),
  CDF=pnorm(x,0,1)
) %>%
  ggplot() +
  geom_line(aes(x=x,y=PDF,color="PDF")) +
  geom_line(aes(x=x,y=CDF,color="CDF"),linetype="dashed")

PDF vs CDF

CDF

  • CDF corresponds to area under density curve up to value of \(x\)
  • 1 minus that value is the area under the curve greater than that value

Generating random samples

  • CDF is useful for generating samples from the distribution

Distributions in R

  • Four key operations we perform with distributions:
  1. Calculate probabilities using the PDF
  2. Calculate cumulative probabilities using the CDF
  3. Calculate the value associated with a cumulative probability
  4. Sample values from a parameterized distribution
  • Each of these operations has a dedicated function for each different distribution

Distributions in R

  • Each distribution has a family of 4 functions
  • Functions end with shortened name of distribution, e.g. norm for normal distribution
  • Distribution functions prefixed by d, p, q, and r, e.g.
  1. dnorm(x, mean=0, sd=1) - PDF of the normal distribution
  2. pnorm(q, mean=0, sd=1) - CDF of the normal distribution
  3. qnorm(p, mean=0, sd=1) - inverse CDF; accepts quantiles between 0 and 1 and returns the value of the distribution for those quantiles
  4. rnorm(n, mean=0, sd=1) - generate n samples from a normal distribution

Distributions in R

Distribution Probability Density Function
Normal dnorm(x,mean,sd)
t Distribution dt(x,df)
Poisson dpois(n,lambda)
Binomial dbinom(x, size, prob)
Negative Binomial dnbinom(x, size, prob, mu)
Exponential dexp(x, rate)
\(\chi^2\) dchisq(x, df)

Types of Distributions

Types of Distributions

  • Broadly 2 types of distributions:
    • Continuous - defined for real numbers within a range
    • Discrete - defined for all objects within a set
  • A distribution may be either:
    • Theoretical (aka parametric) - defined by a parameterized family of mathematical functions
    • Empirical (aka non-parametric) - defined by data

Discrete Distributions

  • Discrete distributions defined over countable, possibly infinite sets
  • Events take one of a set of discrete values
  • Common discrete distributions include
    • binomial (e.g. coin flips)
    • multinomial (e.g. dice rolls)
    • Poisson (e.g. number of injuries by horse kick per day)

Bernoulli random trails

  • Bernoulli trial = a coin flip
  • 2 outcomes with probabilities of \(p\) and \(1-p\)
  • Consider a coin flip:
    • \(Pr(X = 0) = Pr(X = 1) = 0.5\) - coin is fair
    • \(Pr(X = 0) = 0.1, Pr(X = 1) = 0.9\) - coin is not fair
  • Consider dice roll where we count a 6 as a success and any other roll as failure:
    • \(Pr(X = 0) = 5/6, Pr(X = 1) = 1/6\) - if dice are fair

Bernoulli random trials

library(statip) # NB: not a base R distribution
# dbern(x, prob, log = FALSE)
# qbern(p, prob, lower.tail = TRUE, log.p = FALSE)
# pbern(q, prob, lower.tail = TRUE, log.p = FALSE)
# rbern(n, prob)
rbern(10, 0.5)

Bernoulli random trials

rbern(10, 0.5)
##  [1] 1 1 1 0 0 1 1 1 0 1

Binomial distribution

  • Consider a sequence of coin flips, where chance of heads is \(p\)
  • What is the probability of seeing \(x\) heads out of \(n\) flips?
  • The binomial distribution models this situation

\[ Pr(X = x|n,p) = {n \choose x} p^x (1-p) ^{(n-x)} \]

Binomial distribution

# dbinom(x, size, prob, log = FALSE)
# pbinom(q, size, prob, lower.tail = TRUE, log.p = FALSE)
# qbinom(p, size, prob, lower.tail = TRUE, log.p = FALSE)
# rbinom(n, size, prob)
rbinom(10, 10, 0.5)

Binomial distribution

rbinom(10, 10, 0.5)
##  [1] 3 8 7 5 6 4 6 6 5 7
mean(rbinom(1000, 10, 0.5))
## [1] 4.999

Geometric distribution

  • Consider a sequence of coin flips, where chance of heads is \(p\)
  • What is the probability of seeing \(x\) consecutive tails before a head?
  • The geometric distribution models this situation

\[ Pr(X = x|p) = (1-p)^x p \]

Geometric distribution

# dgeom(x, prob, log = FALSE)
# pgeom(q, prob, lower.tail = TRUE, log.p = FALSE)
# qgeom(p, prob, lower.tail = TRUE, log.p = FALSE)
# rgeom(n, prob)
rgeom(10, 0.5)

Geometric distribution

rgeom(10, 0.5)
##  [1] 2 0 3 0 2 0 2 2 0 6

Poisson

  • Consider an event that occurs zero or more times in a particular time period at a known constant average rate \(\lambda\)
  • What is the probability of seeing \(k\) events in a time period?

\[ Pr(X=k|\lambda) = \frac {\lambda^k e^{-\lambda}} {k!} \]

Poisson

# dpois(x, lambda, log = FALSE)
# ppois(q, lambda, lower.tail = TRUE, log.p = FALSE)
# qpois(p, lambda, lower.tail = TRUE, log.p = FALSE)
# rpois(n, lambda)
rpois(10, 5)

Poisson

rpois(10, 5)
##  [1] 4 5 8 6 4 6 5 3 7 2
mean(rpois(1000, 5))
## [1] 4.917

Negative binomial distribution

  • Consider a sequence of coin flips, where chance of heads is \(p\)
  • What is the probability of seeing \(x\) tails by the time we see the \(r\) heads?

\[ Pr(X = x|r,p) = \frac {x+r-1} {r-1} p^r {(1-p)}^x \]

Negative binomial distribution

# dnbinom(x, size, prob, mu, log = FALSE)
# pnbinom(q, size, prob, mu, lower.tail = TRUE, log.p = FALSE)
# qnbinom(p, size, prob, mu, lower.tail = TRUE, log.p = FALSE)
# rnbinom(n, size, prob, mu)
rnbinom(10, 10, 0.5)

Negative binomial distribution

# dnbinom(x, size, prob, mu, log = FALSE)
# pnbinom(q, size, prob, mu, lower.tail = TRUE, log.p = FALSE)
# qnbinom(p, size, prob, mu, lower.tail = TRUE, log.p = FALSE)
# rnbinom(n, size, prob, mu)
rnbinom(10, 10, 0.5)
##  [1] 19  7  9 12  8  8 12 13 12 12

Continuous Distributions

  • Continuous distributions defined over infinite, possibly bounded domains, e.g. all real numbers

Uniform Distribution

  • Consider a number in the range \([a, b]\)
  • All values in range appear with equal probability

\[ P(X=x|a,b) = \begin{cases} \frac{1}{b-a} & \text{for}\; a \le x \le b, \\ 0 & \text{otherwise} \end{cases} \]

Uniform Distribution

dunif(x, min = 0, max = 1, log = FALSE)
punif(q, min = 0, max = 1, lower.tail = TRUE, log.p = FALSE)
qunif(p, min = 0, max = 1, lower.tail = TRUE, log.p = FALSE)
runif(n, min = 0, max = 1)
runif(10, min=0, max=10)

Uniform Distribution

runif(10)
##  [1] 0.62197160 0.20632568 0.13773154 0.94534494 0.24755702 0.33996150
##  [7] 0.28370202 0.06990996 0.65094488 0.63905524
mean(runif(1000))
## [1] 0.5032486

Normal Distribution

  • Normal distribution (Gaussian distribution)

\[ P(X = x|\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}e^{\frac{-(x-\mu)^2}{2\sigma}} \]

Normal Distribution

# dnorm(x, mean = 0, sd = 1, log = FALSE)
# pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
# qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
# rnorm(n, mean = 0, sd = 1)
rnorm(10, mean=10, sd=10)

Normal Distribution

rnorm(10, mean=10, sd=10)
##  [1] 22.264245 22.609307 13.508107  5.892955 -5.542218 18.052817 19.157577
##  [8] 20.213437  3.250644 11.789966
mean(rnorm(1000, mean=10, sd=10))
## [1] 10.04227

\(\chi^2\) Distribution

  • Models the distribution of \(k\) independent standard normal random variables

\[ P(X=x|k) = \begin{cases} \frac{ x^{\frac{k}{2}-1}e^{-\frac{x}{2}} }{ 2^{\frac{k}{2}}\left ( \frac{k}{2} \right ) }, & x>0;\\ 0, & \text{otherwise} \end{cases} \]$

\(\chi^2\) Distribution

# dchisq(x, df, ncp = 0, log = FALSE)
# pchisq(q, df, ncp = 0, lower.tail = TRUE, log.p = FALSE)
# qchisq(p, df, ncp = 0, lower.tail = TRUE, log.p = FALSE)
# rchisq(n, df, ncp = 0)
rchisq(10, 10)

\(\chi^2\) Distribution

rchisq(10, 10)
##  [1]  7.873837 12.733304  3.567957 11.308481  5.557218 10.716813  5.477301
##  [8]  7.458366  8.126337 16.742170
mean(rchisq(1000, 10))
## [1] 9.816199

Empirical Distributions

  • Empirical distributions describe the relative frequency of observed values in a dataset
  • Empirical distributions may have any shape, and may be visualized using any of the methods described in the [Visualizing Distributions] section, e.g. a density plot
  • Empirical distributions can be turned into an empirical distribution function similar to the p* theoretical distribution functions

Empirical Distributions

Empirical CDF

ecdf(d$Value)
## Empirical CDF 
## Call: ecdf(d$Value)
##  x[1:3000] = -6.1284, -6.0381, -5.8883,  ..., 25.555, 26.728
plot(ecdf(d$Value))

Statistical Tests

Exploratory Data Analysis

Exploratory Data Analysis

  • We might lack sufficient information to form a scientific model
  • Must first gain knowledge by examining data
  • Can then possibly generate a scientific model
  • This step sometimes called exploratory data analysis (EDA)

Principal Component Analysis

  • principal component analysis (PCA): statistical procedure that reduces the effective dimensionality of a dataset while preserving variance
  • One of many dimensionality reduction techniques
  • Identifies so-called directions of orthogonal variance called principal components

PCA Principles

  • PCA decomposes a dataset into a set of orthonormal basis vectors (principal components)
  • These principal components (PCs) collectively capture all the variance in the dataset
  • First basis vector is called the first principal component and explains the largest fraction of the variance
  • Second principal component explains the second largest fraction, and so on
  • Always an equal number of principal components as there are dimensions in the dataset or the number of samples, whichever is smaller
  • Typically only small number of PCs needed to explain most of the variance

Principal Components

  • Each principal component is a \(p\)-dimensional unit vector:
    • \(p\) is the number of features in the dataset
    • values are weights that describe the component’s direction of variance
  • Product of each component with the values in each sample, we obtain a projection
  • Projections of each sample made with each principal component produces a rotation of the dataset

PCA in Biological Analysis

  • Biological datasets often have many thousands of features (e.g. genes) and comparatively few samples
  • Maximum number of PCs in PCA is smaller of (number of features, number of samples)
  • stats::prcomp() function performs PCA in R

PCA Example Setup

# intensities contains microarray expression data for ~54k probesets x 20
# samples

# transpose expression values so samples are rows
expr_mat <- intensities %>%
  pivot_longer(-c(probeset_id),names_to="sample") %>%
  pivot_wider(names_from=probeset_id)

# PCA expects all features (i.e. probesets) to be mean-centered,
# convert to dataframe so we can use rownames
expr_mat_centered <-  as.data.frame(
  lapply(dplyr::select(expr_mat,-c(sample)),function(x) x-mean(x))
)
rownames(expr_mat_centered) <- expr_mat$sample

PCA in R

# prcomp performs PCA
pca <- prcomp(
  expr_mat_centered,
  center=FALSE, # already centered
  scale=TRUE # prcomp scales each feature to have unit variance
)

PCA Results in R

# the str() function prints the structure of its argument
str(pca)
## List of 5
##  $ sdev    : num [1:20] 101.5 81 71.2 61.9 57.9 ...
##  $ rotation: num [1:54675, 1:20] 0.00219 0.00173 -0.00313 0.00465 0.0034 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:54675] "X1007_s_at" "X1053_at" "X117_at" "X121_at" ...
##   .. ..$ : chr [1:20] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : logi FALSE
##  $ scale   : Named num [1:54675] 0.379 0.46 0.628 0.272 0.196 ...
##   ..- attr(*, "names")= chr [1:54675] "X1007_s_at" "X1053_at" "X117_at" "X121_at" ...
##  $ x       : num [1:20, 1:20] -67.94 186.14 6.07 -70.72 9.58 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:20] "A1" "A2" "A3" "A4" ...
##   .. ..$ : chr [1:20] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

PCA Results in R

The result of prcomp() is a list with five members:

  • sdev - the standard deviation (i.e. the square root of the variance) for each component
  • rotation - a matrix where the principal components are in columns
  • x - the projections of the original data
  • center - if center=TRUE was passed, a vector of feature means
  • scale - if scale=TRUE was passed, a vector of the feature variances

PC Fraction of Variance

  • Each principal component explains a fraction of the overall variance in the dataset
  • prcomp() result doesn’t provide this fraction
  • sdev variable returned by prcomp() may be used to first calculate the variance explained by each component by squaring it, then dividing by the sum

PC Fraction of Variance

library(patchwork)
pca_var <- tibble(
  PC=factor(str_c("PC",1:20),str_c("PC",1:20)),
  Variance=pca$sdev**2,
  `% Explained Variance`=Variance/sum(Variance)*100,
  `Cumulative % Explained Variance`=cumsum(`% Explained Variance`)
)

PC Fraction of Variance Cont’d

exp_g <- pca_var %>%
  ggplot(aes(x=PC,y=`% Explained Variance`,group=1)) +
  geom_point() +
  geom_line() +
  theme(axis.text.x=element_text(angle=90,hjust=0,vjust=0.5))

cum_g <- pca_var %>%
  ggplot(aes(x=PC,y=`Cumulative % Explained Variance`,group=1)) +
  geom_point() +
  geom_line() +
  ylim(0,100) + # set y limits to [0,100]
  theme(axis.text.x=element_text(angle=90,hjust=0,vjust=0.5))

exp_g | cum_g

PC Fraction of Variance

PCA and Outliers

  • PCA can help identify outlier samples
  • Idea:
    • plot the projections of each sample
    • examine the result by eye to identify samples that are “far away” from the other samples
  • No general rules to decide when a sample is an outlier

Pairwise PC Plots

  • PC projects often plotted against each other, e.g. PC1 vs PC2:
as_tibble(pca$x) %>%
  ggplot(aes(x=PC1,y=PC2)) +
  geom_point()

Pairwise PC Plots

library(GGally)
as_tibble(pca$x) %>%
  dplyr::select(PC1:PC6) %>%
  ggpairs()

Pairwise PC Plots

Beeswarm PC Plots

as_tibble(pca$x) %>%
  pivot_longer(everything(),names_to="PC",values_to="projection") %>%
  mutate(PC=fct_relevel(PC,str_c("PC",1:20))) %>%
  ggplot(aes(x=PC,y=projection)) +
  geom_beeswarm() + labs(title="PCA Projection Plot")

Adding Additional Information

  • Can color our pairwise scatter plot by type like so:
as_tibble(pca$x) %>%
  mutate(type=stringr::str_sub(rownames(pca$x),1,1)) %>%
  ggplot(aes(x=PC1,y=PC2,color=type)) +
  geom_point()

Adding Additional Information

  • Can plot pairs of components as before but now with type information
library(GGally)
as_tibble(pca$x) %>%
  mutate(type=stringr::str_sub(rownames(pca$x),1,1)) %>%
  dplyr::select(c(type,PC1:PC6)) %>%
  ggpairs(columns=1:6,mapping=aes(fill=type))

Adding Additional Information

Beeswarm PC Plots

Cluster Analysis

Cluster Analysis

  • Cluster analysis groups similar objects together
  • Can identify structure or organization in a dataset
  • clustering algorithms have been developed for different situations

Clustering Example

Hierarchical Clustering

  • Hierarchical clustering groups data points together in nested, or hierarchical, groups
  • Data points are compared for similarity using a distance function
  • Two broad strategies:
    • Agglomerative: each datum starts in otart in their own groups, and groups are iteratively merged hierarchically iointsointsnto larger groups data are in a group
    • Divisive: all data points start in the same group, and are recursively split into smaller groups based on their dissimilarity

Clustering: Distance Function

  • Must choose a distance function, e.g \(d(x, y)\)
  • \(d(x, y)\) produces a metric with magnitude proportional to difference between \(x\) and \(y\)
  • A metric is a non-negative number with no meaningful unit
  • Many different metrics, euclidean distance often a reasonable choice

\[d(p, q) = \sqrt{(p - q)^2}\]

Clustering: Linkage Function

Clustering Example in R

Below we will cluster the synthetic dataset introduced above in R:

ggplot(well_clustered_data, aes(x=f1,y=f2)) +
  geom_point() +
  labs(title="Unclustered")

Clustering Example in R

# compute all pairwise distances using euclidean distance
# as the distance metric
euc_dist <- dist(dplyr::select(well_clustered_data,-ID))

# produce a clustering of the data using the hclust for
# hierarchical clustering
hc <- hclust(euc_dist, method="ave")

# add ID as labels to the clustering object
hc$labels <- well_clustered_data$ID

Clustering Example in R

str(hc)
## List of 7
##  $ merge      : int [1:59, 1:2] -50 -33 -48 -46 -44 -21 -11 -4 -36 -3 ...
##  $ height     : num [1:59] 0.00429 0.08695 0.23536 0.24789 0.25588 ...
##  $ order      : int [1:60] 8 9 7 6 19 18 11 16 4 14 ...
##  $ labels     : chr [1:60] "A1" "A2" "A3" "A4" ...
##  $ method     : chr "average"
##  $ call       : language hclust(d = euc_dist, method = "ave")
##  $ dist.method: chr "euclidean"
##  - attr(*, "class")= chr "hclust"

hclust() output

  • hclust() return object describes the clustering as a tree
  • Can visualize using a dendrogram:
library(ggdendro)
ggdendrogram(hc)

Creating Flat Clusters

  • Can split hierarchical clustering into groups based on their pattern
  • Can use the cutree to divide the tree into three groups using k=3:
labels <- cutree(hc,k=3)
labels
##  A1  A2  A3  A4  A5  A6  A7  A8  A9 A10 A11 A12 A13 A14 A15 A16 A17 A18 A19 A20 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  B1  B2  B3  B4  B5  B6  B7  B8  B9 B10 B11 B12 B13 B14 B15 B16 B17 B18 B19 B20 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
##  C1  C2  C3  C4  C5  C6  C7  C8  C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3

Creating Flat Clusters

# we turn our labels into a tibble so we can join them with the original tibble
well_clustered_data %>%
  left_join(
    tibble(
      ID=names(labels),
      label=as.factor(labels)
    )
  ) %>%
  ggplot(aes(x=f1,y=f2,color=label)) +
  geom_point() +
  labs(title="Clustered")

Creating Flat Clusters

# we turn our labels into a tibble so we can join them with the original tibble
well_clustered_data %>%
  left_join(
    tibble(
      ID=names(labels),
      label=as.factor(labels)
    )
  ) %>%
  ggplot(aes(x=f1,y=f2,color=label)) +
  geom_point() +
  labs(title="Clustered")
## Joining with `by = join_by(ID)`