Data Visualization

Data Visualization

  • Data visualization is a core component of both exploring data and communicating results
  • The goal: present data in a graphical way that shows patterns that are otherwise invisible
  • Effective data visualization is challenging!
  • No “gold standard” to follow - only principles and judgment

Properties of good data viz

An effective data visualization:

  1. Depicts accurate data
  2. Depicts data accurately
  3. Shows enough, but not too much, of the data
  4. Is self contained - no additional information (except a caption) is required to understand the contents of the figure

Properties of great data viz

A great visualization has some additional properties:

  1. Exposes patterns in the data not easily observable by other methods
  2. Invites the viewer to ask more questions about the data

Responsible Plotting

Credits

Responsible Plotting

  • “Good plots empower us to ask good questions.” - Alberto Cairo, How Charts Lie
  • Plots convey ideas (and beliefs)
  • Scientific papers often structured (and read) by its figures
  • Making plots is easy
  • Conveying ideas is hard

The “Hockey Stick Chart”

5 Qualities of An Effective Viz

  1. It is truthful
  2. It is functional
  3. It is insightful
  4. It is enlightening
  5. It is beautiful

Visualization Principles

  • Visualizations illustrate (from latin illustrare, to light up or illuminate)
  • Leverage humans’ visual perception system
  • This system is predictive/interpretive
    • i.e. it is a pattern recognition system
  • When this system makes bad predictions, we can experience optical illusions

Contextual Colors

Contextual Colors

Contextual Colors

Lines

Lines

Contextual Area

Seeing Shapes

Seeing Shapes

Key Distortions

  1. We perceive value/hue of colors relative to adjacency colors
  2. Certain geometry interferes with assessing true spatial relationships
  3. We perceive area of shapes relative to adjacent shapes
  4. We may perceive shapes where there are none
  5. We may perceive shapes that look like other shapes

Visual Encodings

  • Visualizations encode data values as visual properties
  • An encoding is a mapping between a data range and a visual property
  • Properties may include:
    • Length/width/height
    • Position
    • Area
    • Angle/proportional area
    • Shape
    • Color hue/value

Visual Decoding

  • Every visualization uses one or more encodings
  • Reading a plot requires decoding from visual back to numbers to form a mental model of the data
  • Familiar encodings (e.g. position in a scatter plot) require less work to interpret than less conventional

Encoding Example

data <- tibble(
  x=rnorm(100),
  y=rnorm(100,0,5)
)
data %>%
  ggplot(aes(x=x,y=y)) +
  geom_point()

data <- mutate(data,
  `x times y`=abs(x*y)
)
data %>%
  ggplot(aes(x=x,y=y,size=`x times y`)) +
  geom_point()

data <- mutate(data,
    z=runif(100),
    category=sample(c('A','B','C'), 100, replace=TRUE)
  )
data %>%
  ggplot(aes(x=x, y=y, size=`x times y`, color=z, shape=category)) +
  geom_point()

data <-  mutate(data, `x + y`=x+y) %>%
  arrange(`x + y`) %>%
  mutate(
    xend=lag(x,1),
    yend=lag(y,1)
  )
data %>%
  ggplot() +
  geom_segment(aes(x=x,xend=xend,y=y,yend=yend,alpha=0.5)) +
  geom_point(aes(x=x, y=y, size=`x times y`, color=z, shape=category))

  1. \(x\) - positional encoding
  2. \(y\) - positional encoding
  3. \(xy\) - area encoding
  4. \(z\) - quantitative encoding
  5. Category - categorical encoding to shape
  6. Adjacency along \(x+y\) - length encoding

More Complex Encodings

tibble(
  value=rnorm(100),
  category='A'
) %>%
  ggplot(aes(x=category,y=value,fill=category)) +
  geom_violin()

Common Encodings

Plot \(x\) encoding \(y\) encoding note
scatter position position
vertical bar position height
horizontal bar width position
lollipop position height for line + position for “head”
violin position width \(x\) transformed to range, \(y\) transformed to densities

Elementary Perceptual Tasks

  • Decoded visualizations represent relationships between data
  • Different encodings enable more or less precise estimates of those relationships
  • Translating from visualizations to quantiative estimates are perceptual tasks
  • These tasks developed into a theory of elementary perceptual tasks by Cleveland and McGill

Precision of Estimates

# two groups of samples with similar random data profiles
data <- as.matrix(
  tibble(
    A=c(rnorm(10),rnorm(10,2)),
    B=c(rnorm(10),rnorm(10,2)),
    C=c(rnorm(10),rnorm(10,2)),
    D=c(rnorm(10,4),rnorm(10,-1)),
    E=c(rnorm(10,4),rnorm(10,-1)),
    F=c(rnorm(10,4),rnorm(10,-1))
  )
)
rownames(data) <- paste0('G',seq(nrow(data)))
heatmap(data)

Precision of Estimates

Precision of Estimates

Precision of Estimates

Data Viz Opinions

  1. Visualize data in multiple ways
  2. Perform statistical analyses to confirm patterns
  3. Informative is better than beautiful
  4. No plot is better than a useless plot
  5. Sometimes a table is the best way to present data
  6. If there is text on a plot, it should be legible
  7. Almost every plot should have properly labeled axes
  8. Be color-blind friendly
  9. Make differences appear as big as they mean

Be Faithful To The Data

library(patchwork)
data <- tibble(
  percent=c(86,88,87,90,93,89),
  ID=c('A','B','C','D','E','F')
)
g <- ggplot(data, aes(x=ID,y=percent)) +
  geom_bar(stat="identity")

g + coord_cartesian(ylim=c(85,95)) | g

Be Faithful To The Data

Grammar of Graphics

Grammar of Graphics

  • “Grammar of graphics”: a system of rules that describes how data and graphical aesthetics are combined to form graphics and plots
  • Aesthetics == color, size, shape, et c
  • First popularized in the book The Grammar of Graphics by Leland Wilkinson and co-authors in 1999

ggplot2

  • ggplot2 R package that implements grammar of graphics
  • Written by Hadley Wickham in 2005

ggplot2 Fundamentals

  • Every plot is the combination of three types of information:
    1. data (i.e. values)
    2. geometry (i.e. shapes)
    3. aesthetics (i.e. connects values and shapes)

ggplot2 Example

  • A simple example dataset:

    ## # A tibble: 20 × 8
    ##    ID    age_at_death condition    tau  abeta   iba1   gfap braak_stage
    ##    <chr>        <dbl> <fct>      <dbl>  <dbl>  <dbl>  <dbl> <fct>      
    ##  1 A1              73 AD         96548 176324 157501  78139 4          
    ##  2 A2              82 AD         95251      0 147637  79348 4          
    ##                              ...
    ## 10 A10             69 AD         48357  27260  47024  78507 2          
    ## 11 C1              80 Control    62684  93739 131595 124075 3          
    ## 12 C2              77 Control    63598  69838   7189  35597 3          
    ##                              ...
    ## 20 C10             73 Control    15781  16592  10271 100858 1

Sidebar: Tau pathology

ggplot2 Example

  • Goal: visualize the relationship between age at death and the amount of tau pathology
  • Try a scatter plot where each marker is a subject with
    • \(x\) is age_at_death
    • \(y\) is tau
    ggplot(
        data=ad_metadata,
        mapping = aes(
          x = age_at_death,
          y = tau
        )
      ) +
      geom_point()

Simple Scatter Plot

ggplot(data=ad_metadata, mapping=aes(x=age_at_death, y=tau)) +
  geom_point()

ggplot2 Plot Components

ggplot(data=ad_metadata, mapping=aes(x=age_at_death, y=tau)) +
  geom_point()
  1. ggplot() - function creates a plot
  2. data= - pass a tibble with the data
  3. mapping=aes(...) - Define an aesthetics mapping connecting data to plot properties
  4. geom_point(...) - Specify geometry as points where marks will be made at pairs of x,y coordinates

Adding More Aesthetics

Is this the whole story?

Adding More Aesthetics

  • There are both AD and Control subjects in this dataset!

  • How does condition relate to this relationship we see?

  • Layer on an additional aesthetic of color:

    ggplot(
        data=ad_metadata,
        mapping = aes(
          x = age_at_death,
          y = tau,
          color=condition # color each point
        )
      ) +
      geom_point()

Adding More Aesthetics

ggplot(data=ad_metadata, mapping=aes(
      x=age_at_death, y=tau, color=condition
    )) + geom_point()

Other Plot Geometries

  • Differences in distributions of variables can be important

  • Examine the distribution of age_at_death for AD and Control samples with violin geometry with geom_violin():

    ggplot(data=ad_metadata, mapping = aes(x=condition, y=age_at_death)) +
      geom_violin()

Violin Plot

ggplot(data=ad_metadata, mapping = aes(x=condition, y=age_at_death)) +
  geom_violin()

Multiple Plots

  • Can put multiple plots in one figure with patchwork library:
library(patchwork)
age_boxplot <- ggplot(
    data=ad_metadata,
    mapping = aes(x=condition, y=age_at_death)
  ) +
  geom_boxplot()
tau_boxplot <- ggplot(
    data=ad_metadata,
    mapping=aes(x=condition, y=tau)
  ) +
  geom_boxplot()

age_boxplot | tau_boxplot # this puts the plots side by side

Side by Side Plots

age_boxplot | tau_boxplot # this puts the plots side by side

R Graph Gallery

ggplot Mechanics

ggplot Mechanics

  • ggplot has two key concepts that give it great flexibility: layers and scales
  • A layer is one set of data drawn with a geometry and an aesthetic
  • A scale is the mapping from the data values to visual properties
  • A plot may have one or more layers
  • Different layers may share scales or have their own

ggplot Layers

  • Each layer is a set of data connected to a geometry and an aesthetic

  • Each geom_X() function adds a layer to a plot

  • The plot has three layers:

    ggplot(data=ad_metadata, mapping=aes(x=age_at_death)) +
      geom_point(mapping=aes(y=tau, color='blue')) +
      geom_point(mapping=aes(y=abeta, color='red')) +
      geom_point(mapping=aes(y=iba1, color='cyan'))

ggplot Layers

ggplot Scales

  • A scale maps data onto a range, e.g.:
    • pixel range
    • color on a gradient/set of colors
    • shape type, circle or square
    • shape dimension, like circle radius
  • Multiple layers on the same plot
    • Must share at least one scale to be plotted together
    • May differ in one or more scales to be distinguished from each other

ggplot Scales

How many layers? How many scales?

ggplot Incompatible Scales

ggplot(data=ad_metadata, mapping=aes(x=braak_stage)) +
  geom_point(mapping=aes(y=tau, color='blue')) +
  geom_point(mapping=aes(y=age_at_death, color='red'))

ggplot Incompatible Scales

Plotting One Dimension

Bar Charts

Bar chart

  • Map length (i.e. height or width of rectangle) proportional to scalar value
ggplot(ad_metadata,
  mapping = aes(
    x=ID,
    y=tau)
  ) +
  geom_bar(stat="identity")

Bar chart

ggplot(ad_metadata, mapping = aes(x=ID,y=tau)) +
  geom_bar(stat="identity")

More insightful bar chart

  • Change the fill color of the bars based on condition:
ggplot(ad_metadata, mapping = aes(x=ID,y=tau,fill=condition)) +
  geom_bar(stat="identity")

More insightful bar chart

ggplot(ad_metadata, mapping = aes(x=ID,y=tau,fill=condition)) +
  geom_bar(stat="identity")

Diverging bar chart

  • Bar charts can also plot negative numbers
mutate(ad_metadata, tau_centered=(tau - mean(tau))) %>%
  ggplot(mapping = aes(x=ID, y=tau_centered, fill=condition)) +
  geom_bar(stat="identity")

Diverging bar chart

Lollipop plots

Lollipop plots

  • Similar to bar charts, “lollipop plots” replace bar with a line segment and a circle
  • No dedicated geometry - use geom_point and geom_segment layers:
ggplot(ad_metadata) +
  geom_point(mapping=aes(x=ID, y=tau)) +
  geom_segment(mapping=aes(x=ID, xend=ID, y=0, yend=tau))

Lollipop plots

Stacked Area charts

Stacked Area charts

  • Visualize multiple 1D data that share a common categorical axis
pivot_longer(
    ad_metadata,
    c(tau,abeta,iba1,gfap),
    names_to='Marker',
    values_to='Intensity'
  ) %>%
  ggplot(aes(x=ID,y=Intensity,group=Marker,fill=Marker)) +
    geom_area()

Stacked Area charts

Stacked Area Charts Components

Stacked area plots require three pieces of data:

  • x - a numeric or categorical axis for vertical alignment
  • y - a numeric axis to draw vertical proportions
  • group - a categorical variable that indicates which (x,y) pairs correspond to the same line

Proportional Stacked Area Charts

  • Can view the relative proportion of values in each category rather than the actual values
pivot_longer(
    ad_metadata,
    c(tau,abeta,iba1,gfap),
    names_to='Marker',
    values_to='Intensity'
  ) %>%
  group_by(ID) %>% # divide each intensity values by sum markers
  mutate(
    `Relative Intensity`=Intensity/sum(Intensity)
  ) %>%
  ungroup() %>% # ungroup restores the tibble to original number of rows
  ggplot(aes(x=ID,y=`Relative Intensity`,group=Marker,fill=Marker)) +
    geom_area()

Proportional Stacked Area Charts

Visualizing Distributions

Visualizing Distributions

  • A distribution describes the general “shape” of a set of numbers
    • i.e. what is the relative frequency of the values, or ranges of values
  • Must understand distribution of data to choose statistical methods appropriately

Histogram

Histogram

  • First described by Karl Pearson
  • Type of bar chart
  • Divides up the range of a dataset from minimum to maximum into bins usually of the same size
  • Each bin represented by a bar with height or width proportional to number of values that fall within that bin
ggplot(ad_metadata) +
  geom_histogram(mapping=aes(x=age_at_death))

Histogram

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Changing histogram bins

ggplot(ad_metadata) +
  geom_histogram(mapping=aes(x=age_at_death),bins=10)

Histograms sensitive to number of data points

  • Histogram of synthetic dataset of 1000 normally distributed values:
tibble(x=rnorm(1000)) %>%
  ggplot() +
  geom_histogram(aes(x=x))

Histograms sensitive to number of data points

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Histograms with multiple distributions

  • Can add multiple histograms to same plot
tibble(
  x=c(rnorm(1000),rnorm(1000,mean=4)),
  type=c(rep('A',1000),rep('B',1000))
) %>%
  ggplot(aes(x=x,fill=type)) +
  geom_histogram(bins=30, alpha=0.6, position="identity")

Histograms with multiple distributions

Density plots

Density plot

  • Similar to histogram, except instead of binning the values draws a smoothly interpolated line that approximates the distribution

  • Density plot is always normalized so the integral under the curve is approximately 1

    ggplot(ad_metadata) +
      geom_density(mapping=aes(x=age_at_death),fill="#c9a13daa")

Density plot

Density plot vs histogram

library(patchwork)
hist_g <- ggplot(ad_metadata) +
  geom_histogram(mapping=aes(x=age_at_death),bins=30)
density_g <- ggplot(ad_metadata) +
  geom_density(mapping=aes(x=age_at_death),fill="#c9a13daa")

hist_g | density_g

Density plot vs histogram

Density plot vs histogram

library(patchwork)
normal_samples <- tibble(
  x=c(rnorm(1000),rnorm(1000,mean=4)),
  type=c(rep('A',1000),rep('B',1000))
)
hist_g <- ggplot(normal_samples) +
  geom_histogram(
    mapping=aes(x=x,fill=type),
    alpha=0.6, position="identity", bins=30
)
density_g <- ggplot(normal_samples) +
  geom_density(
    mapping=aes(x=x,fill=type),
    alpha=0.6, position="identity"
  )

hist_g | density_g

Density plot vs histogram

Boxplots

Boxplot

  • The histogram depicts distribution as a “box and whiskers”
  • Assume data are unimodal (i.e. roughly bell-shaped)
  • Explicitly draws:
    • Median
    • 25th and 75th percentile (a.k.a. 1st and 3rd quartile)
    • “whiskers” that show further extents
    • Some have markers for outlier samples
ggplot(ad_metadata) +
  geom_boxplot(mapping=aes(x=condition,y=age_at_death))

Boxplot

ggplot(ad_metadata) +
  geom_boxplot(mapping=aes(x=condition,y=age_at_death))

Boxplot Anatomy

Boxplot shortcomings

normal_samples <- tibble(
  x=c(rnorm(1000),rnorm(1000,4),rnorm(1000,2,3)),
  type=c(rep('A',2000),rep('B',1000))
)
ggplot(normal_samples, aes(x=x,fill=type,alpha=0.6)) +
  geom_density()

Boxplot shortcomings

ggplot(normal_samples, aes(x=type,y=x,fill=type)) +
  geom_boxplot()

Boxplot shortcomings

library(patchwork)
g <- ggplot(normal_samples, aes(x=type,y=x,fill=type))
boxplot_g <- g + geom_boxplot()
violin_g <- g + geom_violin()

boxplot_g | violin_g

Boxplot shortcomings

  • Boxplots can be misleading

Violin Plots

Violin plot

  • violin plot produces a shape where the width is proportional to the density of values value along the x or y axis *Similar in principle to a histogram or a density plot
ggplot(ad_metadata) +
  geom_violin(aes(x=condition,y=tau,fill=condition))

Violin plot

Beeswarm Plots

Beeswarm plot

  • beeswarm plot similar to a violin plot
  • Plots the data itself as points like in a scatter plot
  • Points are ‘jittered’ so they don’t overlap
library(ggbeeswarm)
ggplot(ad_metadata) +
  geom_beeswarm(
    aes(x=condition,y=age_at_death,color=condition),
    cex=2,
    size=2
  )

Beeswarm plot

Beeswarm plot limitations

  • Typically only useful when the number of values is not too many or too few:
normal_samples <- tibble(
  x=c(rnorm(1000),rnorm(1000,4),rnorm(1000,2,3)),
  type=c(rep('A',2000),rep('B',1000))
)
ggplot(normal_samples, aes(x=type,y=x,color=type)) +
  geom_beeswarm()

Beeswarm plot limitations

ggplot(normal_samples, aes(x=type,y=x,color=type)) +
  geom_beeswarm()

Seeing patterns in bees

  • Can color bees by another value
normal_samples <- tibble(
  x=c(rnorm(100),rnorm(100,4),rnorm(100,2,3)),
  type=c(rep('A',200),rep('B',100)),
  category=sample(c('healthy','disease'),300,replace=TRUE)
)
ggplot(normal_samples, aes(x=type,y=x,color=category)) +
  geom_beeswarm()

Seeing patterns in bees

Ridgeline Plots

Ridgeline

  • ridgeline charts plot many non-trivial distributions
  • Simply multiple density plots drawn for different variables within the same plot
library(ggridges)

tibble(
  x=c(rnorm(100),rnorm(100,4),rnorm(100,2,3)),
  type=c(rep('A',200),rep('B',100)),
) %>%
  ggplot(aes(y=type,x=x,fill=type)) +
  geom_density_ridges()

Ridgeline

## Picking joint bandwidth of 0.822

Many ridgelines

tibble(
  x=rnorm(10000,mean=runif(10,1,10),sd=runif(2,1,4)),
  type=rep(c("A","B","C","D","E","F","G","H","I","J"),1000)
) %>%
  ggplot(aes(y=type,x=x,fill=type)) +
  geom_density_ridges(alpha=0.6,position="identity")

Many ridgelines

## Picking joint bandwidth of 0.494

Plotting Two or More Dimensions

Scatter Plots

  • Scatter plots visualize pairs of quantities, usually continuous, as points in a two dimensional space
  • Usually in cartesian coordinates, but polar coordinates or other types of coordinate systems are possible
ggplot(ad_metadata,mapping=aes(x=abeta, y=tau)) +
  geom_point(size=3)

Scatter Plots

ggplot(ad_metadata,mapping=aes(x=abeta, y=tau)) +
  geom_point(size=3)

Scatter Plots: Marker Shapes

ggplot(ad_metadata,mapping=aes(x=abeta, y=tau, shape=condition)) +
  geom_point(size=3)

Scatter Plots: Marker Shapes

Scatter Plots: Marker Shapes

g <- ggplot()
for(x in 0:5) {
  for(y in 0:4) {
    if(x+y*6 < 26) {
      g <- g + geom_point(
        tibble(x=x,y=y),
        aes(x=x,y=y),
        shape=x+y*6,size=8) +
       geom_label(
         tibble(x=x,y=y,label=x+y*6),
         aes(x=x,y=y+0.5,label=label)
       )
    }
  }
}
g

Scatter Plots: Marker Shapes

ggplot(ad_metadata,mapping=aes(x=abeta, y=tau, shape=condition)) +
  geom_point(size=3) +
  scale_shape_manual(values=c(3,9))

Scatter Plots: Marker Shapes

Scatter Plots: Color Encodings

  • Color encodings added using either continuous or discrete values:
library(patchwork)
g <- ggplot(ad_metadata)
g_condition <- g + geom_point(mapping=aes(x=abeta, y=tau, color=condition),size=3)
g_age <- g + geom_point(mapping=aes(x=abeta, y=tau, color=age_at_death),size=3)
g_condition / g_age

Scatter Plots: Color Encodings

Bubble Plots

  • Close relative of the scatter plot where the area of the point markers is proportional to a third dimension
ggplot(ad_metadata,mapping=aes(x=abeta, y=tau, size=age_at_death)) +
  geom_point(alpha=0.5)

Bubble Plots

Connected Scatter Plots

  • Close relative of the scatter plot where certain pairs of points are connected with a line
arrange(ad_metadata,age_at_death) %>%
  mutate(
    x=abeta,
    xend=lag(x,1),
    y=tau,
    yend=lag(y,1)
  ) %>%
  ggplot() +
  geom_segment(aes(x=abeta, xend=xend, y=tau, yend=yend)) +
  geom_point(aes(x=x,y=y,shape=condition,color=condition),size=3)

Connected Scatter Plots

Line Plots

  • Line plots connect pairs of points with a line without drawing a symbol at each point
  • geom_line() function draws lines between pairs of points sorted by \(x\) axis by default
ggplot(ad_metadata,mapping=aes(x=abeta, y=tau)) +
    geom_line()

Line Plots

Line Plots: Multiple Lines

  • Plot multiple lines using the group aesthetic mapping
pivot_longer(ad_metadata,
  c(tau,abeta,iba1,gfap),
  names_to='Marker',
  values_to='Intensity'
  ) %>%
  ggplot(ad_metadata,mapping=aes(x=ID, y=Intensity, group=Marker, color=Marker)) +
    geom_line()

Line Plots: Multiple Lines

  • Plot multiple lines using the group aesthetic mapping

Parallel Coordinate Plots

  • Line plots where:
    • 2+ continuous variables given vertical position encodings and
    • Each sample has its own line
  • Each vertical axis can have a different scale
library(GGally)

ggparcoord(ad_metadata,
           columns=c(2,4:8),
           groupColumn=3,
           showPoints=TRUE
           ) +
  scale_color_manual(values=c("#bbbbbb", "#666666"))

Parallel Coordinate Plots

Note: hexadecimal color codes

  • Colors can be specified with a 6- or 8- hexit hexadecimal code
  • Hexit - 0 - f, 0=0 and f=16
  • Codes form a triplet #rrggbb
    • rr is the value of the color red
    • gg for green
    • bb for blue
  • #ffffff is the color white
  • #000000 is the color black
  • #ff0000 is the color red
  • #0000ff is the color blue
  • #7fff00 is this color (called chartreuse)

Heatmaps

Heatmaps

  • Heatmaps visualize values associated with a grid of points \((x,y,z)\) as a grid of colored rectangles
  • \(x\) and \(y\) define the grid point coordinates and \(z\) is a continuous value
  • A common heatmap you might have seen is the weather map, which plots current or predicted weather patterns on top of a geographic map:

Weathermaps

Heatmaps in R

  • Heatmaps are often used to visualize matrices
  • Can create heatmap in R using the base R heatmap() function:
  • heatmap() function creates a clustered heatmap where the rows and columns have been hierarchically clustered
# heatmap() requires a R matrix, and cannot accept a tibble or a dataframe
marker_matrix <- as.matrix(
  dplyr::select(ad_metadata,c(tau,abeta,iba1,gfap))
)
# rownames of the matrix become y labels
rownames(marker_matrix) <- ad_metadata$ID

heatmap(marker_matrix)

Example Heatmap

# heatmap() requires a R matrix, and cannot accept a tibble or a dataframe
marker_matrix <- as.matrix(
  dplyr::select(ad_metadata,c(tau,abeta,iba1,gfap))
)
# rownames of the matrix become y labels
rownames(marker_matrix) <- ad_metadata$ID

heatmap(marker_matrix)

heatmap() Functionality

  • Performs hierarchical clustering of the rows and columns using a Euclidean distance function
  • Draws dendrograms on the rows and columns
  • Scales the data in the rows to have mean zero and standard deviation 1
  • Can alter this behavior with arguments:
heatmap(
  marker_matrix,
  Rowv=NA, # don't cluster rows
  Colv=NA, # don't cluster columns
  scale="none", # don't scale rows
)

Less Fancy Heatmap

heatmap() Drawback

  • The scale mapping \(z\) values to colors is very important when interpreting heatmaps
  • heatmap() function has the major drawback that no color key is provided!
  • heatmap.2() in gplots package has a similar interface
  • Provides more parameters to control the behavior of the plot and includes a color key:
library(gplots)
heatmap.2(marker_matrix)

heatmap.2() Example

Annotating Rows and Columns

  • heatmap() and heatmap.2() can annotate rows and columns with a categorical variable along the margins
# with heatmap()
condition_colors <-
  transmute(
    ad_metadata,
    color=if_else(condition == "AD","red","blue")
  )
heatmap(
  marker_matrix,
  RowSideColors=condition_colors$color
)

Annotating Rows and Columns

Annotating Rows and Columns

# with heatmap.2()
heatmap.2(
  marker_matrix,
  RowSideColors=condition_colors$color
)

Annotating Rows and Columns

Heatmaps in ggplot

  • “Heatmaps” in ggplot use geom_tile geometry
  • geom_tile requires data in long format with x, y, and z values
pivot_longer(
  ad_metadata,
  c(tau,abeta,iba1,gfap),
  names_to="Marker",
  values_to="Intensity"
) %>%
  ggplot(aes(x=Marker,y=ID,fill=Intensity)) +
  geom_tile()

Heatmaps in ggplot

Native R color palettes

Specifying Heatmap Colors

# native R colors are:
# - rainbow(n, start=.7, end=.1)
# - heat.colors(n)
# - terrain.colors(n)
# - topo.colors(n)
# - cm.colors(n)
# the n argument specifies the number of colors (i.e. resolution) of the colormap to return
heatmap(marker_matrix,col=cm.colors(256))

Specifying Heatmap Colors

Heatmap Colors in ggplot

  • With ggplot and geom_tile(), use the scale_fill_gradientn function to specify a different color palette
pivot_longer(
  ad_metadata,
  c(tau,abeta,iba1,gfap),
  names_to="Marker",
  values_to="Intensity"
) %>%
  ggplot(aes(x=Marker,y=ID,fill=Intensity)) +
  geom_tile() +
  scale_fill_gradientn(colors=cm.colors(256))

Heatmap Colors in ggplot

Less ugly heatmap colors

How To Use Heatmaps Responsibly

  • Heatmaps are quite complicated and can easily mislead us

  • Four factors influence how a dataset can be visualized as a heatmap:

    1. The type of features, i.e. whether the features are continuous or discrete
    2. The relative scales of the features
    3. The total range of the data
    4. Whether or not the data are centered

Heatmaps: Discrete Features

Heatmaps: Continuous Features

  • Continuous features map numeric values to a color gradient

Heatmaps: Scales of Features

  • Generally, all features in heatmap must be on a comparable scale, or transformed appropriately to attain this
data <- tibble(
  ID=paste0('F',seq(10)),
  a=rnorm(10,0,1),
  b=rnorm(10,100,20),
  c=rnorm(10,20,5)
) %>%
  pivot_longer(c(a,b,c))

ggplot(data,aes(x=name,y=ID,fill=value)) +
  geom_tile()

Heatmaps: Scales of Features

Heatmaps: Scales of Features

library(ggbeeswarm)
ggplot(data) +
  geom_beeswarm(aes(x=name,y=value,color=name))

Heatmaps: Transform Features

  • Features with different ranges must be scaled so they are all comparable
  • e.g. z-transform
data %>%
  pivot_wider(id_cols='ID',names_from=name) %>%
  mutate(
    across(c(a,b,c),scale)
  ) %>%
  pivot_longer(c(a,b,c)) %>%
  ggplot(aes(x=name,y=ID,fill=value)) +
  geom_tile()

Heatmaps: Transform Features

Heatmaps: Transform Features

Heatmaps: Transform Features

  • Scaling rows vs columns dramatically changes the heatmap
data %>%
  pivot_wider(id_cols=name,names_from=ID) %>%
  mutate(
    across(starts_with('F'),scale)
  ) %>%
  pivot_longer(starts_with('F'),names_to="ID") %>%
  ggplot(aes(x=name,y=ID,fill=value)) +
  geom_tile()

Heatmaps: Transform Features

Heatmaps: Total data range

  • Extreme values, i.e. outliers, in any of the features can render a heatmap uninformative
tibble(
  ID=paste0('F',seq(10)),
  a=rnorm(10,0,1),
  b=rnorm(10,0,1),
  c=c(rnorm(9,0,1),1e9)
) %>%
  pivot_longer(c(a,b,c)) %>%
  ggplot(aes(x=name,y=ID,fill=value)) +
  geom_tile()

Heatmaps: Total data range

Heatmaps: Data Distribution

  • Heatmaps “work best” when data are normally distributed
  • If they are not:
data <- tibble(
  ID=paste0('F',seq(10)),
  a=10**rnorm(10,4,1),
  b=10**rnorm(10,4,1),
  c=10**rnorm(10,4,1)
) %>%
  pivot_longer(c(a,b,c))
data %>%
  ggplot(aes(x=name,y=ID,fill=value)) +
  geom_tile()

Heatmaps: Data Distribution

Heatmaps: Centered Data

  • Some features have a meaningful “center”
    • e.g. log2 fold change of 0 divides data into up and down
  • A diverging color palette appropriate in these cases
  • Be sure to align the center of the color palette with center value
    • e.g. midpoint=0 when central value is 0

Heatmaps: Centered Data

tibble(
  ID=paste0('F',seq(10)),
  a=rnorm(10,0,1),
  b=rnorm(10,0,1),
  c=rnorm(10,0,1)
) %>%
  pivot_longer(c(a,b,c)) %>%
  ggplot(aes(x=name,y=ID,fill=value)) +
  geom_tile() +
  scale_fill_gradient2(
    low="#000099", mid="#ffffff", high="#990000",
    midpoint=0
  )

Heatmaps: Centered Data

Heatmaps: Centered Data

  • These data centered around 1, but default centers color palette around 0!
tibble(
  ID=paste0('F',seq(10)),
  a=rnorm(10,1,1),
  b=rnorm(10,1,1),
  c=rnorm(10,1,1)
) %>%
  pivot_longer(c(a,b,c)) %>%
  ggplot(aes(x=name,y=ID,fill=value)) +
  geom_tile() +
  scale_fill_gradient2(low="#000099", mid="#ffffff", high="#990000")

Heatmaps: Centered Data

Heatmaps: Centered Data

  • Be sure to set midpoint appropriately!

Heatmaps: 2-pole Data

Other Plot Types

Chord Diagrams and Circos Plots

  • Whole genome data can produce long vectors of data
    • e.g. read pileup across all chromosomes
  • Screens/publication formats have limited space
  • How to visualize an entire genome in a rectangular space?
  • How to depict relationships between disparate genomic loci?
  • Circos is a software package originally designed to handle this kind of data

Circos Plot Example

Circos Plot Example

Circos Plot Example

Circos Plot Example

circlize R package

  • Circos is written in PERL
  • circlize package provides R implementation

circlize R package

Multiple Plots

Multiple Plots

  • Often want to put multiple plots in same figure
  • Two approaches:
    • subplots: combine unrelated plots into one figure
    • facet wrapping: separate the same dataset into different plots

Multipanel Figures

  • patchwork library composes ggplot objects together using an intuitive set of operators:
    • a | b - put plots side-by-side
    • a / b - put plot a above b
    • (a | b) / c - put a and b side-by-side, and c below them

Multipanel Figures

data <- tibble(
  a=rnorm(100,0,1),
  b=rnorm(100,3,2)
)
g_scatter <- ggplot(data, aes(x=a, y=b)) +
  geom_point()
g_violin <- pivot_longer(data, c(a,b)) %>%
  ggplot(aes(x=name,y=value,fill=name)) +
  geom_violin()

Multipanel Figures

g_scatter | g_violin

Multipanel Figures

g_scatter / g_violin

Multipanel Figures

g_scatter / ( g_scatter | g_violin)

Multipanel Figures

(g_scatter / g_scatter ) | g_violin

Facet wrapping

  • Facet wrapping separates subsets of a dataset into plots with identical axes

Facet wrapping

library(mvtnorm) # package implementing multivariate normal distributions
nsamp <- 100
data <-  rbind(
    rmvnorm(nsamp,c(1,1),sigma=matrix(c(1,0.8,0.8,1),nrow=2)),
    rmvnorm(nsamp,c(1,1),sigma=matrix(c(1,-0.8,-0.8,1),nrow=2)),
    rmvnorm(nsamp,c(1,1),sigma=matrix(c(1,0,0,1),nrow=2))
)
colnames(data) <- c('x','y')
g_oneplot <- as_tibble(data) %>%
  mutate(
    sample_name=c(rep('A',nsamp),rep('B',nsamp),rep('C',nsamp))
  ) %>%
  ggplot(aes(x=x,y=y,color=sample_name)) +
  geom_point()
g_oneplot

Facet wrapping

facet_wrap() function

  • Plot may be split into three using the facet_wrap() function
g_oneplot + facet_wrap(vars(sample_name))

facet_wrap() function

  • Can apply transforms to each facet
g_oneplot + facet_wrap(vars(sample_name)) +
  geom_smooth(method="loess", formula=y ~ x)

Publication Ready Plots

Publication Ready Plots

  • ggplot styling is plain and recognizeable
  • Some journals may have strict formatting requirements
  • Tweaking plot aesthetics in code can be very tedious
  • Can address these issues with
    • ggplot themes
    • Exporting to Scalable Vector Graphics (SVG) format

ggplot Themes

  • Themes in ggplot are combinations of styling elements applied to the different components of a chart
base_g <- tibble(
  x=rnorm(100),
  y=rnorm(100)
) %>%
  ggplot(aes(x=x, y=y)) +
  geom_point()

Default Theme

base_g

Black and white Theme

  • ggplot comes with other themes that may be added to plots with theme_X() functions
base_g + theme_bw()

Classic Theme

base_g + theme_classic()

Exporting to SVG

  • All ggplot plots may be saved in Scalable Vector Graphics (SVG) format
  • SVG is an XML format describes shapes using a mathematical specification
  • e.g. a circle:
<circle cx="50" cy="50" r="50"/>

Editing SVGs

  • Illustration programs can edit the individual elements of the entire plot
(g1 / g2) | g3

Exporting to SVG

  • Save to SVG format by using ggsave() function
ggsave('multipanel.svg')
## Saving 7.5 x 4.5 in image

Exporting to SVG