The Tidyverse

  • The tidyverse is “an opinionated collection of R packages designed for data science.”
  • packages are all designed to work together
  • tidyverse practically changes the R language into a data science language
  • tidyverse uses a distinct set of coding conventions that lets it achieve greater expressiveness, conciseness, and correctness relative to the base R language

Tidyverse Basics

  • tidyverse is a set of packages that work together

  • Load the most common tidyverse packages at the same time:

    library(tidyverse)
    -- Attaching packages ---------- tidyverse 1.3.1 --
    v ggplot2 3.3.5     v purrr   0.3.4
    v tibble  3.1.6     v dplyr   1.0.7
    v tidyr   1.1.4     v stringr 1.4.0
    v readr   2.1.1     v forcats 0.5.1
    -- Conflicts ------------- tidyverse_conflicts() --
    x dplyr::filter() masks stats::filter()
    x dplyr::lag()    masks stats::lag()

Default tidyverse packages

Package Description
ggplot2 Plotting using the grammar of graphics
tibble Simple and sane data frames
tidyr Operations for making data “tidy”
readr Read rectangular text data into tidyverse
purrr Functional programming tools for tidyverse
dplyr “A Grammar of Data Manipulation”
stringr Makes working with strings in R easier
forcats Operations for using categorical variables

Importing Data

  • readr package has functions to read in data from text files:
Function Brief description/use
read_csv Delimiter: , - Decimal separator: .
read_csv2 Delimiter: ; - Decimal separator: ,
read_tsv Delimiter: <tab> - Decimal separator: .
read_delim Delimiter: set by user - Decimal separator: .

Note readr functions operate on zipped files

  • Some CSV files can be very large and may be compressed
  • the most common compression formats in data science and biology are gzip and bzip.
  • All the readr file reading functions can read compressed files directly, so you do not need to decompress them firs

Writing tabular files

Function Brief description/use
write_csv Delimiter: , - Decimal separator: .
write_csv2 Delimiter: ; - Decimal separator: ,
write_tsv Delimiter: <tab> - Decimal separator: .
write_delim Delimiter: set by user - Decimal separator: .

The tibble

  • Data in tidyverse organized in a special data frame object called a tibble
  library(tibble)
  tbl <- tibble(
      x = rnorm(100, mean=20, sd=10),
      y = rnorm(100, mean=50, sd=5)
  )
  tbl
  # A tibble: 100 x 2
         x     y
      
   1 16.5   54.6
   2 14.4   54.3
   # ... with 98 more rows

The tibble is a data frame

  • A tibble stores rectangular data that can be accessed like a data frame:
  tbl$x
  [1] 29.57249 12.01577 15.25536 23.07761 32.25403 48.04651 21.90576
  [8] 15.51168 34.87285 21.32433 12.51230 23.60896  6.77630 12.34223
  ...
  tbl[1,"x"] # access the first element of x
  # A tibble: 1 x 1
      x
    
  1  29.6
  tbl$x[1]
  [1] 29.57255

tibble column names

  • tibbles (and regular data frames) typically have names for their columns accessed with the colnames function
  tbl <- tibble(
      x = rnorm(100, mean=20, sd=10),
      y = rnorm(100, mean=50, sd=5)
  )
  colnames(tbl)
  [1] "x" "y"

Changing tibble column names

  • Column names may be changed using this same function:
  colnames(tbl) <- c("a","b")
  tbl
  # A tibble: 100 x 2
         a     b
      
   1 16.5   54.6
   2 14.4   54.3
  # ... with 90 more rows
  • Can also use dplyr::rename to rename columns as well:
  dplyr::rename(tbl,
    a = x,
    b = y
  )

tibbles, data frames, and row names

  • tibbles and dataframes also have row names as well as column names:
  tbl <- tibble(
      x = rnorm(100, mean=20, sd=10),
      y = rnorm(100, mean=50, sd=5)
  )
  rownames(tbl)
  [1] "1" "2" "3"...
  • tibble support for row names is only included for compatibility with base R data frames
  • Authors of tidyverse believe row names are better stored as a normal column

tribble() - Constructing tibbles by hand

  • tibble package allows creating simple tibbles with the tribble() function:
gene_stats <- tribble(
    ~gene, ~test1_stat, ~test1_p, ~test2_stat, ~test2_p,
   "hoxd1", 12.509293,   0.1032,   34.239521,   1.3e-5,
   "brca1",  4.399211,   0.6323,   16.332318,   0.0421,
   "brca2",  9.672011,   0.9323,   12.689219,   0.0621,
   "snca", 45.748431,   4.2e-4,    0.757188,   0.9146,
)

Tidy Data

  • tidyverse packages designed to operate with so-called “tidy data”

  • the following rules make data tidy:

    1. Each variable must have its own column
    2. Each observation must have its own row
    3. Each value must have its own cell
  • a variable is a quantity or property that every observation in our dataset has

  • each observation is a separate instance of those variable

    • (e.g. a different sample, subject, etc)

Example of tidy data

  1. Each variable has its own column.
  2. Each observation has its own row.
  3. Each value has its own cell.
gene_stats
## # A tibble: 4 × 5
##   gene  test1_stat test1_p test2_stat  test2_p
##   <chr>      <dbl>   <dbl>      <dbl>    <dbl>
## 1 hoxd1      12.5  0.103       34.2   0.000013
## 2 brca1       4.40 0.632       16.3   0.0421  
## 3 brca2       9.67 0.932       12.7   0.0621  
## 4 snca       45.7  0.00042      0.757 0.915

Tidy data illustration

pipes with %>%

pipes

  • We often must perform serial operations on a data frame
  • For example:
    1. Read in a file
    2. Rename one of the columns
    3. Subset the rows based on some criteria
    4. Compute summary statistics on the result

Base R serial manipulations

  • In base R we would need to do this with assignments

    # data_file.csv has two columns: bad_cOlumn_name and numeric_column
    data <- readr::read_csv("data_file.csv")
    data <- dplyr::rename(data, "better_column_name"=bad_cOlumn_name)
    data <- dplyr::filter(data, better_column_name %in% c("condA","condB"))
    data_grouped <- dplyr::group_by(data, better_column_name)
    summarized <- dplyr::summarize(data_grouped, mean(numeric_column))

pipes with %>%

  • A key tidyverse programming pattern is chaining manipulations of tibbles together by passing the result of one manipulation as input to the next

  • Tidyverse defines the %>% operator to do this:

    data <- readr::read_csv("data_file.csv") %>%
          dplyr::rename("better_column_name"=bad_cOlumn_name) %>%
          dplyr::filter(better_column_name %in% c("condA","condB")) %>%
          dplyr::group_by(better_column_name) %>%
          dplyr::summarize(mean(numeric_column))
  • The %>% operator passes the result of the function immediately preceding it as the first argument to the next function automatically

Arranging Data

  • Often need to manipulate data in tibbles in various ways
  • Such manipulations might include:
    • Filtering out certain rows
    • Renaming poorly named columns
    • Deriving new columns using the values in others
    • Changing the order of rows etc.
  • These operations may collectively be termed arranging the data
  • Many provided in the *dplyr package

Arranging Data - dplyr::mutate()

dplyr::mutate() - Create new columns using other columns

Made up gene statistics example

  • Consider the following tibble with made up gene information
gene_stats
## # A tibble: 4 × 5
##   gene  test1_stat test1_p test2_stat  test2_p
##   <chr>      <dbl>   <dbl>      <dbl>    <dbl>
## 1 hoxd1      12.5  0.103       34.2   0.000013
## 2 brca1       4.40 0.632       16.3   0.0421  
## 3 brca2       9.67 0.932       12.7   0.0621  
## 4 snca       45.7  0.00042      0.757 0.915

dplyr::mutate() - FDR example

gene_stats <- dplyr::mutate(gene_stats,
  test1_padj=p.adjust(test1_p,method="fdr")
)
gene_stats
## # A tibble: 4 × 6
##   gene  test1_stat test1_p test2_stat  test2_p test1_padj
##   <chr>      <dbl>   <dbl>      <dbl>    <dbl>      <dbl>
## 1 hoxd1      12.5  0.103       34.2   0.000013    0.206  
## 2 brca1       4.40 0.632       16.3   0.0421      0.843  
## 3 brca2       9.67 0.932       12.7   0.0621      0.932  
## 4 snca       45.7  0.00042      0.757 0.915       0.00168

dplyr::mutate() - FDR example

  • You can create multiple columns in the same call to mutate():

    gene_stats <- dplyr::mutate(gene_stats,
      test1_padj=p.adjust(test1_p,method="fdr"),
      test2_padj=p.adjust(test2_p,method="fdr")
    )
    gene_stats
    ## # A tibble: 4 × 7
    ##   gene  test1_stat test1_p test2_stat  test2_p test1_padj test2_padj
    ##   <chr>      <dbl>   <dbl>      <dbl>    <dbl>      <dbl>      <dbl>
    ## 1 hoxd1      12.5  0.103       34.2   0.000013    0.206     0.000052
    ## 2 brca1       4.40 0.632       16.3   0.0421      0.843     0.0828  
    ## 3 brca2       9.67 0.932       12.7   0.0621      0.932     0.0828  
    ## 4 snca       45.7  0.00042      0.757 0.915       0.00168   0.915

dplyr::mutate() - Deriving from multiple columns

  • Here we create a column with TRUE or FALSE if either or both of the adjusted p-values are less than \(0.05\):

    gene_stats <- dplyr::mutate(gene_stats,
      signif_either=(test1_padj < 0.05 | test2_padj < 0.05),
      signif_both=(test1_padj < 0.05 & test2_padj < 0.05)
    )
  • | and & operators execute ‘or’ and ‘and’ logic, respectively

dplyr::mutate() - using derived columns

  • Columns created first in a mutate() call can be used in subsequent column definitions:
gene_stats <- dplyr::mutate(gene_stats,
  test1_padj=p.adjust(test1_p,method="fdr"), # test1_padj created
  test2_padj=p.adjust(test2_p,method="fdr"),
  signif_either=(test1_padj < 0.05 | test2_padj < 0.05), #test1_padj used
  signif_both=(test1_padj < 0.05 & test2_padj < 0.05)
)

dplyr::mutate() - Modifying columns

  • mutate() can also be used to modify columns in place

  • Example replaces the values in the gene column with upper cased values

    dplyr::mutate(gene_stats, gene=stringr::str_to_upper(gene) )
    ## # A tibble: 4 × 9
    ##   gene  test1_stat test1_p test2_stat  test2_p test1_padj test2_padj
    ##   <chr>      <dbl>   <dbl>      <dbl>    <dbl>      <dbl>      <dbl>
    ## 1 HOXD1      12.5  0.103       34.2   0.000013    0.206     0.000052
    ## 2 BRCA1       4.40 0.632       16.3   0.0421      0.843     0.0828  
    ## 3 BRCA2       9.67 0.932       12.7   0.0621      0.932     0.0828  
    ## 4 SNCA       45.7  0.00042      0.757 0.915       0.00168   0.915   
    ## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>

Arranging Data - dplyr::filter()

dplyr::filter() - Pick rows out of a data set

  • Consider the following tibble with made up gene information
gene_stats
## # A tibble: 4 × 9
##   gene  test1_stat test1_p test2_stat  test2_p test1_padj test2_padj
##   <chr>      <dbl>   <dbl>      <dbl>    <dbl>      <dbl>      <dbl>
## 1 HOXD1      12.5  0.103       34.2   0.000013    0.206     0.000052
## 2 BRCA1       4.40 0.632       16.3   0.0421      0.843     0.0828  
## 3 BRCA2       9.67 0.932       12.7   0.0621      0.932     0.0828  
## 4 SNCA       45.7  0.00042      0.757 0.915       0.00168   0.915   
## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>

dplyr::filter() - Filter based on text values

  • We can use filter() on the data frame to look for the genes BRCA1 and BRCA2
gene_stats %>% filter(gene == "BRCA1" | gene == "BRCA2")
## # A tibble: 2 × 9
##   gene  test1_stat test1_p test2_stat test2_p test1_padj test2_padj
##   <chr>      <dbl>   <dbl>      <dbl>   <dbl>      <dbl>      <dbl>
## 1 BRCA1       4.40   0.632       16.3  0.0421      0.843     0.0828
## 2 BRCA2       9.67   0.932       12.7  0.0621      0.932     0.0828
## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>

dplyr::filter() - Filter based on numeric values

  • We can use filter() to restrict genes to those that are significant at padj < 0.01
gene_stats %>% filter(test1_padj < 0.01)
## # A tibble: 1 × 9
##   gene  test1_stat test1_p test2_stat test2_p test1_padj test2_padj
##   <chr>      <dbl>   <dbl>      <dbl>   <dbl>      <dbl>      <dbl>
## 1 SNCA        45.7 0.00042      0.757   0.915    0.00168      0.915
## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>

stringr - Working with character values

stringr - Working with character values

  • Base R does not have very convenient functions for working with character strings
  • We must frequently manipulate strings while loading, cleaning, and analyzing datasets
  • The stringr package aims to make working with strings “as easy as possible.”

stringr - Working with character values

  • Package includes many useful functions for operating on strings:
    • searching for patterns
    • mutating strings
    • lexicographical sorting
    • concatenation
    • complex search/replace operations
  • stringr documentation and the very helpful stringr cheatsheet.

stringr Example: upper case

  • The function stringr::str_to_upper() with the dplyr::mutate() function to cast an existing column to upper case

    dplyr::mutate(gene_stats,
      gene=stringr::str_to_upper(gene)
    )
    ## # A tibble: 4 × 9
    ##   gene  test1_stat test1_p test2_stat  test2_p test1_padj test2_padj
    ##   <chr>      <dbl>   <dbl>      <dbl>    <dbl>      <dbl>      <dbl>
    ## 1 HOXD1      12.5  0.103       34.2   0.000013    0.206     0.000052
    ## 2 BRCA1       4.40 0.632       16.3   0.0421      0.843     0.0828  
    ## 3 BRCA2       9.67 0.932       12.7   0.0621      0.932     0.0828  
    ## 4 SNCA       45.7  0.00042      0.757 0.915       0.00168   0.915   
    ## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>

Regular expressions

  • Many operations in stringr package use regular expression syntax
  • A regular expression is a “mini” programming language that describes patterns in text
  • Certain characters have special meaning that help in defining search patterns that identifies the location of sequences of characters in text
  • Similar to but more powerful than “Find” functionality in many word processors

Regular expression example

  • Consider the tibble with made up gene information:
gene_stats
## # A tibble: 4 × 9
##   gene  test1_stat test1_p test2_stat  test2_p test1_padj test2_padj
##   <chr>      <dbl>   <dbl>      <dbl>    <dbl>      <dbl>      <dbl>
## 1 HOXD1      12.5  0.103       34.2   0.000013    0.206     0.000052
## 2 BRCA1       4.40 0.632       16.3   0.0421      0.843     0.0828  
## 3 BRCA2       9.67 0.932       12.7   0.0621      0.932     0.0828  
## 4 SNCA       45.7  0.00042      0.757 0.915       0.00168   0.915   
## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>
  • Used filter() to look for literal “BRCA1” and “BRCA2”
  • Names follow a pattern of BRCAX, where X is 1 or 2

Regular expression example

stringr::str_detect(c("HOX1A","BRCA1","BRCA2","SNCA"), "^BRCA[12]$")
## [1] FALSE  TRUE  TRUE FALSE
dplyr::filter(gene_stats, str_detect(gene,"^BRCA[12]$"))
## # A tibble: 2 × 9
##   gene  test1_stat test1_p test2_stat test2_p test1_padj test2_padj
##   <chr>      <dbl>   <dbl>      <dbl>   <dbl>      <dbl>      <dbl>
## 1 BRCA1       4.40   0.632       16.3  0.0421      0.843     0.0828
## 2 BRCA2       9.67   0.932       12.7  0.0621      0.932     0.0828
## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>

Regular expression example

dplyr::filter(gene_stats, str_detect(gene,"^BRCA[12]$"))
  • The argument "^BRCA[12]$" is a regular expression that searches for the following:

    • Gene name starts with BRCA (^BRCA)
    • Of those, include genes only if followed by either 1 or 2 ([12])
    • Of those, match successfully if the number is last character ($)

Regular expression syntax

  • Regular expression syntax has certain characters with special meaning:
    • . - match any single character
    • * - match zero or more of the character immediately preceding the * "", "1", "11", "111", ...
    • + - match one or more of the character immediately preceding the * "1", "11", "111", ...
    • [XYZ] - match one of any of the characters between [] (e.g. X, Y, or Z)
    • ^, $ - match the beginning and end of the string, respectively
  • There are more special characters as well, good tutorial: RegexOne - regular expression tutorial

Arranging Data - dplyr::select()

dplyr::select() - Subset Columns by Name

stats <- dplyr::select(gene_stats, test1_stat, test2_stat)
stats
## # A tibble: 4 × 2
##   test1_stat test2_stat
##        <dbl>      <dbl>
## 1      12.5      34.2  
## 2       4.40     16.3  
## 3       9.67     12.7  
## 4      45.7       0.757

dplyr::select() - Helper functions

  • dplyr also has “helper functions” for more flexible selection of columns
  • For example, if all of the columns we wished to select ended with _stat, we could use the ends_with() helper function:
stats <- dplyr::select(gene_stats, ends_with("_stat"))
stats
## # A tibble: 4 × 2
##   test1_stat test2_stat
##        <dbl>      <dbl>
## 1      12.5      34.2  
## 2       4.40     16.3  
## 3       9.67     12.7  
## 4      45.7       0.757

dplyr::select() - Renaming columns

  • select() allows for the renaming of selected columns:
stats <- dplyr::select(gene_stats,
  t=test1_stat,
  chisq=test2_stat
)
stats
## # A tibble: 4 × 2
##       t  chisq
##   <dbl>  <dbl>
## 1 12.5  34.2  
## 2  4.40 16.3  
## 3  9.67 12.7  
## 4 45.7   0.757

dplyr::select() - Reorder columns

dplyr::select(gene_stats,
    gene,
    test1_stat, test1_p, test1_padj,
    test2_stat, test2_p, test2_padj,
    signif_either,
    signif_both
)
## # A tibble: 4 × 9
##   gene  test1_stat test1_p test1_padj test2_stat  test2_p test2_padj
##   <chr>      <dbl>   <dbl>      <dbl>      <dbl>    <dbl>      <dbl>
## 1 HOXD1      12.5  0.103      0.206       34.2   0.000013   0.000052
## 2 BRCA1       4.40 0.632      0.843       16.3   0.0421     0.0828  
## 3 BRCA2       9.67 0.932      0.932       12.7   0.0621     0.0828  
## 4 SNCA       45.7  0.00042    0.00168      0.757 0.915      0.915   
## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>

Arranging Data - dplyr::arrange()

dplyr::arrange() - Order rows based on their values

  • Sometimes want to reorder rows, e.g. by p-value
dplyr::arrange(gene_stats, test1_p)
## # A tibble: 4 × 9
##   gene  test1_stat test1_p test2_stat  test2_p test1_padj test2_padj
##   <chr>      <dbl>   <dbl>      <dbl>    <dbl>      <dbl>      <dbl>
## 1 SNCA       45.7  0.00042      0.757 0.915       0.00168   0.915   
## 2 HOXD1      12.5  0.103       34.2   0.000013    0.206     0.000052
## 3 BRCA1       4.40 0.632       16.3   0.0421      0.843     0.0828  
## 4 BRCA2       9.67 0.932       12.7   0.0621      0.932     0.0828  
## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>

dplyr::arrange() - Changing sort direction

  • dplyr::arrange() sort ascending by default
  • Can change order to descending with desc() helper function
# desc() sorts descending
dplyr::arrange(gene_stats, desc(abs(test1_stat)))
## # A tibble: 4 × 9
##   gene  test1_stat test1_p test2_stat  test2_p test1_padj test2_padj
##   <chr>      <dbl>   <dbl>      <dbl>    <dbl>      <dbl>      <dbl>
## 1 SNCA       45.7  0.00042      0.757 0.915       0.00168   0.915   
## 2 HOXD1      12.5  0.103       34.2   0.000013    0.206     0.000052
## 3 BRCA2       9.67 0.932       12.7   0.0621      0.932     0.0828  
## 4 BRCA1       4.40 0.632       16.3   0.0421      0.843     0.0828  
## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>

Putting it all together

In the previous sections, we performed the following operations:

  1. Created new FDR on the nominal p-values using the dplyr::mutate() and p.adjust functions
  2. Created new boolean significance for each gene using dplyr::mutate()
  3. Mutated the gene symbol case using stringr::str_to_upper and dplyr::mutate()
  4. Reordered the columns to group related variables with select()
  5. Filtered genes by FDR < 0.05 for either and both statistical tests using dplyr::filter()
  6. Sorted the results by p-value using dplyr::arrange()

Putting it all together

gene_stats <- dplyr::mutate(gene_stats,
  test1_padj=p.adjust(test1_p,method="fdr"),
  test2_padj=p.adjust(test2_p,method="fdr"),
  signif_either=(test1_padj < 0.05 | test2_padj < 0.05),
  signif_both=(test1_padj < 0.05 & test2_padj < 0.05),
  gene=stringr::str_to_upper(gene)
) %>%
dplyr::select(
    gene,
    test1_stat, test1_p, test1_padj,
    test2_stat, test2_p, test2_padj,
    signif_either,
    signif_both
) %>%
dplyr::filter( test1_padj < 0.05 | test2_padj < 0.05 ) %>%
dplyr::arrange( test1_p )

Grouping Data

Grouping Data

  • Grouping data together allow us to summarize them

  • Consider:

    metadata <- tribble(
        ~ID, ~condition, ~age_at_death, ~Braak_stage, ~APOE_genotype,
      "A01",        "AD",            78,            5,       "e4/e4",
      "A02",        "AD",            81,            6,       "e3/e4",
      "A03",        "AD",            90,            5,       "e4/e4",
      "A04",   "Control",            80,            1,       "e3/e4",
      "A05",   "Control",            79,            0,       "e3/e3",
      "A06",   "Control",            81,            0,       "e2/e3"
    )

Summarizing groups of data

metadata
## # A tibble: 6 × 5
##   ID    condition age_at_death Braak_stage APOE_genotype
##   <chr> <chr>            <dbl>       <dbl> <chr>        
## 1 A01   AD                  78           5 e4/e4        
## 2 A02   AD                  81           6 e3/e4        
## 3 A03   AD                  90           5 e4/e4        
## 4 A04   Control             80           1 e3/e4        
## 5 A05   Control             79           0 e3/e3        
## 6 A06   Control             81           0 e2/e3
  • How many samples of each condition are there?
  • Are the age at death/Braak stage distributions similar?
  • Must group samples together based on condition to answer these kinds of questions

Grouping with dplyr::group_by()

  • dplyr::group_by() groups rows together based on condition

    dplyr::group_by(metadata,
      condition
    )
    ## # A tibble: 6 × 5
    ## # Groups:   condition [2]
    ##   ID    condition age_at_death Braak_stage APOE_genotype
    ##   <chr> <chr>            <dbl>       <dbl> <chr>        
    ## 1 A01   AD                  78           5 e4/e4        
    ## 2 A02   AD                  81           6 e3/e4        
    ## 3 A03   AD                  90           5 e4/e4        
    ## 4 A04   Control             80           1 e3/e4        
    ## 5 A05   Control             79           0 e3/e3        
    ## 6 A06   Control             81           0 e2/e3

Summarizing with dplyr::summarize()

  • dplyr::summarize() (or summarise()) to compute the mean age at death for each group:

    dplyr::group_by(metadata,
      condition
    ) %>% dplyr::summarize(mean_age_at_death = mean(age_at_death))
    ## # A tibble: 2 × 2
    ##   condition mean_age_at_death
    ##   <chr>                 <dbl>
    ## 1 AD                       83
    ## 2 Control                  80

Summarizing more

dplyr::group_by(metadata,
  condition
) %>% dplyr::summarize(
  mean_age_at_death = mean(age_at_death),
  sd_age_at_death = sd(age_at_death),
  lower_age = mean_age_at_death-sd_age_at_death,
  upper_age = mean_age_at_death+sd_age_at_death,
)
## # A tibble: 2 × 5
##   condition mean_age_at_death sd_age_at_death lower_age upper_age
##   <chr>                 <dbl>           <dbl>     <dbl>     <dbl>
## 1 AD                       83            6.24      76.8      89.2
## 2 Control                  80            1         79        81

Helper functions with summarize()

  • dplyr::summarize() has some helper functions
  • n() provides the number of rows each group has
dplyr::group_by(metadata,
  condition
) %>% dplyr::summarize(
  num_subjects = n(),
  mean_age_at_death = mean(age_at_death),
)
## # A tibble: 2 × 3
##   condition num_subjects mean_age_at_death
##   <chr>            <int>             <dbl>
## 1 AD                   3                83
## 2 Control              3                80

Rearranging Data

Rearranging Data

  • Sometimes the shape and format of our data doesn’t let us summarize easily

  • Consider our previous example:

    gene_stats <- tribble(
        ~gene, ~test1_stat, ~test1_p, ~test2_stat, ~test2_p,
       "APOE",   12.509293,   0.1032,   34.239521,   1.3e-5,
      "HOXD1",    4.399211,   0.6323,   16.332318,   0.0421,
       "SNCA",   45.748431,   4.2e-9,    0.757188,   0.9146,
    )

Rearranging Data

  • What are the mean and range of our statistic columns?

  • Could do this manually like so:

    tribble(
       ~test_name, ~min, ~mean, ~max,
       "test1_stat", min(gene_stats$test1_stat),
           mean(gene_stats$test1_stat), max(gene_stats$test1_stat),
       "test2_stat",  min(gene_stats$test2_stat),
           mean(gene_stats$test2_stat), max(gene_stats$test2_stat),
    )
    ## # A tibble: 2 × 4
    ##   test_name    min  mean   max
    ##   <chr>      <dbl> <dbl> <dbl>
    ## 1 test1_stat 4.40   20.9  45.7
    ## 2 test2_stat 0.757  17.1  34.2

Summarizing by column

  • In previous example, we were summarizing each statistic column
  • group_by() and summarize() summarize rows!
  • We can pivot the table so columns become rows with tidyr::pivot_longer()

tidyr::pivot_longer()

tidyr::pivot_longer(
  gene_stats,
  c(test1_stat, test2_stat), # columns to pivot
  names_to="test",
  values_to="stat"
)
## # A tibble: 6 × 5
##   gene       test1_p  test2_p test         stat
##   <chr>        <dbl>    <dbl> <chr>       <dbl>
## 1 APOE  0.103        0.000013 test1_stat 12.5  
## 2 APOE  0.103        0.000013 test2_stat 34.2  
## 3 HOXD1 0.632        0.0421   test1_stat  4.40 
## 4 HOXD1 0.632        0.0421   test2_stat 16.3  
## 5 SNCA  0.0000000042 0.915    test1_stat 45.7  
## 6 SNCA  0.0000000042 0.915    test2_stat  0.757

Pivot longer illustration

tidyr::pivot_longer()

long_gene_stats <- tidyr::pivot_longer(
  gene_stats,
  ends_with("_stat"), # was c(test1_stat, test2_stat),
  names_to="test",
  values_to="stat"
)
## # A tibble: 6 × 5
##   gene       test1_p  test2_p test         stat
##   <chr>        <dbl>    <dbl> <chr>       <dbl>
## 1 APOE  0.103        0.000013 test1_stat 12.5  
## 2 APOE  0.103        0.000013 test2_stat 34.2  
## 3 HOXD1 0.632        0.0421   test1_stat  4.40 
## 4 HOXD1 0.632        0.0421   test2_stat 16.3  
## 5 SNCA  0.0000000042 0.915    test1_stat 45.7  
## 6 SNCA  0.0000000042 0.915    test2_stat  0.757

Pivot, Group, Summarize

  • To summarize, group_by() on the test column and summarize() on the stat column:
long_gene_stats %>%
  dplyr::group_by(test) %>%
  dplyr::summarize(
    min = min(stat), mean = mean(stat), max = max(stat)
  )
## # A tibble: 2 × 4
##   test         min  mean   max
##   <chr>      <dbl> <dbl> <dbl>
## 1 test1_stat 4.40   20.9  45.7
## 2 test2_stat 0.757  17.1  34.2

Note: tidyr::pivot_wider()

  • Inverse of pivot_longer() is pivot_wider()
  • If you have variables gathered in single columns like that produced by pivot_longer(), reverses process to create tibble with those variables as columns
  • Can “reorganize” tibble by first pivot_longer() followed by pivot_wider()

Pivot wider illustration

# Relational Data

Relational Data

  • Often need to combine different sources of data together to aid in interpretation of results
  • Consider our fake gene statistics tibble:
tribble(
    ~gene, ~test1_stat, ~test1_p, ~test2_stat, ~test2_p,
   "APOE",   12.509293,   0.1032,   34.239521,   1.3e-5,
  "HOXD1",    4.399211,   0.6323,   16.332318,   0.0421,
   "SNCA",   45.748431,   4.2e-9,    0.757188,   0.9146,
)
## # A tibble: 3 × 5
##   gene  test1_stat      test1_p test2_stat  test2_p
##   <chr>      <dbl>        <dbl>      <dbl>    <dbl>
## 1 APOE       12.5  0.103            34.2   0.000013
## 2 HOXD1       4.40 0.632            16.3   0.0421  
## 3 SNCA       45.7  0.0000000042      0.757 0.915

Match rows between tibbles

  • gene column in tibble is a gene symbol

  • Other gene IDs exist for the same gene, e.g. Ensembl Gene IDs, that are associated with other information

  • Need to match up gene symbol with corresponding Ensembl Gene ID using a known mapping:

    gene_map <- tribble(
        ~symbol,            ~ENSGID,                    ~gene_name,
         "APOE",  "ENSG00000130203",            "apolipoprotein E",
        "BRCA1",  "ENSG00000012048", "BRCA1 DNA repair associated",
        "HOXD1",  "ENSG00000128645",                 "homeobox D1",
         "SNCA",  "ENSG00000145335",             "synuclein alpha",
    )

Match rows between tibbles

## # A tibble: 4 × 3
##   symbol ENSGID          gene_name                  
##   <chr>  <chr>           <chr>                      
## 1 APOE   ENSG00000130203 apolipoprotein E           
## 2 BRCA1  ENSG00000012048 BRCA1 DNA repair associated
## 3 HOXD1  ENSG00000128645 homeobox D1                
## 4 SNCA   ENSG00000145335 synuclein alpha

Match rows between tibbles

  • Term for matching up rows of two tibbles (tables) called joining

  • We join the gene_stats tibble with the gene_map tibble

  • Match the gene column with the symbol column

  • dplyr::left_join() function accepts two data frames and the names of columns in each that share common values:

    dplyr::left_join(
        x=gene_stats,
        y=gene_map,
        by=c("gene" = "symbol")
    )

Match rows between tibbles

## # A tibble: 3 × 7
##   gene  test1_stat      test1_p test2_stat  test2_p ENSGID          gene_name   
##   <chr>      <dbl>        <dbl>      <dbl>    <dbl> <chr>           <chr>       
## 1 APOE       12.5  0.103            34.2   0.000013 ENSG00000130203 apolipoprot…
## 2 HOXD1       4.40 0.632            16.3   0.0421   ENSG00000128645 homeobox D1 
## 3 SNCA       45.7  0.0000000042      0.757 0.915    ENSG00000145335 synuclein a…

Match rows between tibbles

  • Same thing with pipe:

    gene_stats %>% dplyr::left_join(
      gene_map,
      by=c("gene" = "symbol")
    )
    ## # A tibble: 3 × 7
    ##   gene  test1_stat      test1_p test2_stat  test2_p ENSGID          gene_name   
    ##   <chr>      <dbl>        <dbl>      <dbl>    <dbl> <chr>           <chr>       
    ## 1 APOE       12.5  0.103            34.2   0.000013 ENSG00000130203 apolipoprot…
    ## 2 HOXD1       4.40 0.632            16.3   0.0421   ENSG00000128645 homeobox D1 
    ## 3 SNCA       45.7  0.0000000042      0.757 0.915    ENSG00000145335 synuclein a…

Missing values in join

  • The “direction” of the join determines which values appear in output

  • Left join means include all rows in “left” tibble even if there is no match

  • Note BRCA1 is in gene_map but not gene_stats:

    dplyr::left_join(
      gene_map, # gene_map is "left"
      gene_stats, # gene_stats is "right"
      by=c("symbol" = "gene")
    )

Missing values in join

## # A tibble: 4 × 7
##   symbol ENSGID          gene_name       test1_stat  test1_p test2_stat  test2_p
##   <chr>  <chr>           <chr>                <dbl>    <dbl>      <dbl>    <dbl>
## 1 APOE   ENSG00000130203 apolipoprotein…      12.5   1.03e-1     34.2    1.3 e-5
## 2 BRCA1  ENSG00000012048 BRCA1 DNA repa…      NA    NA           NA     NA      
## 3 HOXD1  ENSG00000128645 homeobox D1           4.40  6.32e-1     16.3    4.21e-2
## 4 SNCA   ENSG00000145335 synuclein alpha      45.7   4.20e-9      0.757  9.15e-1

Join directions

  • “Right joins” are simply the opposite of left joins:

    dplyr::right_join(
      gene_stats, # gene_stats is "left"
    gene_map, # gene_map is "right"
      by=c("gene" = "symbol")
    )
    ## # A tibble: 4 × 7
    ##   gene  test1_stat       test1_p test2_stat   test2_p ENSGID          gene_name 
    ##   <chr>      <dbl>         <dbl>      <dbl>     <dbl> <chr>           <chr>     
    ## 1 APOE       12.5   0.103            34.2    0.000013 ENSG00000130203 apolipopr…
    ## 2 HOXD1       4.40  0.632            16.3    0.0421   ENSG00000128645 homeobox …
    ## 3 SNCA       45.7   0.0000000042      0.757  0.915    ENSG00000145335 synuclein…
    ## 4 BRCA1      NA    NA                NA     NA        ENSG00000012048 BRCA1 DNA…

Inner join

  • Inner joins return results only for rows that have a match between the two tibbles

  • Recall gene_map included BRCA1 even though it was not found in gene_stats

  • Inner join will not include this row, because no match in gene_stats was found:

    dplyr::inner_join(
      gene_map,
      gene_stats,
      by=c("symbol" = "gene")
    )

Inner join

## # A tibble: 3 × 7
##   symbol ENSGID          gene_name        test1_stat  test1_p test2_stat test2_p
##   <chr>  <chr>           <chr>                 <dbl>    <dbl>      <dbl>   <dbl>
## 1 APOE   ENSG00000130203 apolipoprotein E      12.5   1.03e-1     34.2   1.3 e-5
## 2 HOXD1  ENSG00000128645 homeobox D1            4.40  6.32e-1     16.3   4.21e-2
## 3 SNCA   ENSG00000145335 synuclein alpha       45.7   4.20e-9      0.757 9.15e-1

Full or outer join

  • dplyr::full_join() (also sometimes called an outer join)
  • a full join will return all rows from both tibbles whether a match in the other table was found or not

Multiple matches

  • Sometimes one table will have multiple rows that match the values in the other
  • Read the textbook chapter on how to handle