Data Wrangling

The Tidyverse

The tidyverse is “an opinionated collection of R packages designed for data science.”
packages are all designed to work together
tidyverse practically changes the R language into a data science language
tidyverse uses a distinct set of coding conventions that lets it achieve greater expressiveness, conciseness, and correctness relative to the base R language

Tidyverse Basics

tidyverse is a set of packages that work together

Load the most common tidyverse packages at the same time:

library(tidyverse)
-- Attaching packages ---------- tidyverse 1.3.1 --
v ggplot2 3.3.5     v purrr   0.3.4
v tibble  3.1.6     v dplyr   1.0.7
v tidyr   1.1.4     v stringr 1.4.0
v readr   2.1.1     v forcats 0.5.1
-- Conflicts ------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

Default tidyverse packages

Package	Description
ggplot2	Plotting using the grammar of graphics
tibble	Simple and sane data frames
tidyr	Operations for making data “tidy”
readr	Read rectangular text data into tidyverse
purrr	Functional programming tools for tidyverse
dplyr	“A Grammar of Data Manipulation”
stringr	Makes working with strings in R easier
forcats	Operations for using categorical variables

Importing Data

readr package has functions to read in data from text files:

Function	Brief description/use
`read_csv`	Delimiter: `,` - Decimal separator: `.`
`read_csv2`	Delimiter: `;` - Decimal separator: `,`
`read_tsv`	Delimiter: `<tab>` - Decimal separator: `.`
`read_delim`	Delimiter: set by user - Decimal separator: `.`

Note `readr` functions operate on zipped files

Some CSV files can be very large and may be compressed
the most common compression formats in data science and biology are gzip and bzip.
All the readr file reading functions can read compressed files directly, so you do not need to decompress them firs

Writing tabular files

Note that readr also has functions for writing delimited files:

Function	Brief description/use
`write_csv`	Delimiter: `,` - Decimal separator: `.`
`write_csv2`	Delimiter: `;` - Decimal separator: `,`
`write_tsv`	Delimiter: `<tab>` - Decimal separator: `.`
`write_delim`	Delimiter: set by user - Decimal separator: `.`

The tibble

Data in tidyverse organized in a special data frame object called a tibble

  library(tibble)
  tbl <- tibble(
      x = rnorm(100, mean=20, sd=10),
      y = rnorm(100, mean=50, sd=5)
  )
  tbl
  # A tibble: 100 x 2
         x     y
      
   1 16.5   54.6
   2 14.4   54.3
   # ... with 98 more rows

The tibble is a data frame

A tibble stores rectangular data that can be accessed like a data frame:

  tbl$x
  [1] 29.57249 12.01577 15.25536 23.07761 32.25403 48.04651 21.90576
  [8] 15.51168 34.87285 21.32433 12.51230 23.60896  6.77630 12.34223
  ...
  tbl[1,"x"] # access the first element of x
  # A tibble: 1 x 1
      x
    
  1  29.6
  tbl$x[1]
  [1] 29.57255

tibble column names

tibbles (and regular data frames) typically have names for their columns accessed with the colnames function

  tbl <- tibble(
      x = rnorm(100, mean=20, sd=10),
      y = rnorm(100, mean=50, sd=5)
  )
  colnames(tbl)
  [1] "x" "y"

Changing tibble column names

Column names may be changed using this same function:

  colnames(tbl) <- c("a","b")
  tbl
  # A tibble: 100 x 2
         a     b
      
   1 16.5   54.6
   2 14.4   54.3
  # ... with 90 more rows

Can also use dplyr::rename to rename columns as well:

  dplyr::rename(tbl,
    a = x,
    b = y
  )

tibbles, data frames, and row names

tibbles and dataframes also have row names as well as column names:

  tbl <- tibble(
      x = rnorm(100, mean=20, sd=10),
      y = rnorm(100, mean=50, sd=5)
  )
  rownames(tbl)
  [1] "1" "2" "3"...

tibble support for row names is only included for compatibility with base R data frames
Authors of tidyverse believe row names are better stored as a normal column

`tribble()` - Constructing tibbles by hand

tibble package allows creating simple tibbles with the tribble() function:

gene_stats <- tribble(
    ~gene, ~test1_stat, ~test1_p, ~test2_stat, ~test2_p,
   "hoxd1", 12.509293,   0.1032,   34.239521,   1.3e-5,
   "brca1",  4.399211,   0.6323,   16.332318,   0.0421,
   "brca2",  9.672011,   0.9323,   12.689219,   0.0621,
   "snca", 45.748431,   4.2e-4,    0.757188,   0.9146,
)

Tidy Data

tidyverse packages designed to operate with so-called “tidy data”
the following rules make data tidy:
1. Each variable must have its own column
2. Each observation must have its own row
3. Each value must have its own cell
a variable is a quantity or property that every observation in our dataset has
each observation is a separate instance of those variable
- (e.g. a different sample, subject, etc)

Example of tidy data

Each variable has its own column.
Each observation has its own row.
Each value has its own cell.

gene_stats

## # A tibble: 4 × 5
##   gene  test1_stat test1_p test2_stat  test2_p
##   <chr>      <dbl>   <dbl>      <dbl>    <dbl>
## 1 hoxd1      12.5  0.103       34.2   0.000013
## 2 brca1       4.40 0.632       16.3   0.0421  
## 3 brca2       9.67 0.932       12.7   0.0621  
## 4 snca       45.7  0.00042      0.757 0.915

Tidy data illustration

pipes with `%>%`

pipes

We often must perform serial operations on a data frame
For example:
1. Read in a file
2. Rename one of the columns
3. Subset the rows based on some criteria
4. Compute summary statistics on the result

Base R serial manipulations

In base R we would need to do this with assignments

# data_file.csv has two columns: bad_cOlumn_name and numeric_column
data <- readr::read_csv("data_file.csv")
data <- dplyr::rename(data, "better_column_name"=bad_cOlumn_name)
data <- dplyr::filter(data, better_column_name %in% c("condA","condB"))
data_grouped <- dplyr::group_by(data, better_column_name)
summarized <- dplyr::summarize(data_grouped, mean(numeric_column))

pipes with `%>%`

A key tidyverse programming pattern is chaining manipulations of tibbles together by passing the result of one manipulation as input to the next

Tidyverse defines the %>% operator to do this:

data <- readr::read_csv("data_file.csv") %>%
      dplyr::rename("better_column_name"=bad_cOlumn_name) %>%
      dplyr::filter(better_column_name %in% c("condA","condB")) %>%
      dplyr::group_by(better_column_name) %>%
      dplyr::summarize(mean(numeric_column))

The %>% operator passes the result of the function immediately preceding it as the first argument to the next function automatically

Arranging Data

Often need to manipulate data in tibbles in various ways
Such manipulations might include:
- Filtering out certain rows
- Renaming poorly named columns
- Deriving new columns using the values in others
- Changing the order of rows etc.
These operations may collectively be termed arranging the data
Many provided in the *dplyr package

Arranging Data - `dplyr::mutate()`

`dplyr::mutate()` - Create new columns using other columns

Sometimes we want to create new columns derived from other columns
Example: Multiple testing correction
Adjusts nominal p-values to account for the number of tests that are significant simply bu chance
Benjamini-Hochberg or False Discovery Rate (FDR) procedure, a common procedure
p.adjust function can perform several of these procedures, including FDR

Made up gene statistics example

Consider the following tibble with made up gene information

gene_stats

## # A tibble: 4 × 5
##   gene  test1_stat test1_p test2_stat  test2_p
##   <chr>      <dbl>   <dbl>      <dbl>    <dbl>
## 1 hoxd1      12.5  0.103       34.2   0.000013
## 2 brca1       4.40 0.632       16.3   0.0421  
## 3 brca2       9.67 0.932       12.7   0.0621  
## 4 snca       45.7  0.00042      0.757 0.915

`dplyr::mutate()` - FDR example

gene_stats <- dplyr::mutate(gene_stats,
  test1_padj=p.adjust(test1_p,method="fdr")
)
gene_stats

## # A tibble: 4 × 6
##   gene  test1_stat test1_p test2_stat  test2_p test1_padj
##   <chr>      <dbl>   <dbl>      <dbl>    <dbl>      <dbl>
## 1 hoxd1      12.5  0.103       34.2   0.000013    0.206  
## 2 brca1       4.40 0.632       16.3   0.0421      0.843  
## 3 brca2       9.67 0.932       12.7   0.0621      0.932  
## 4 snca       45.7  0.00042      0.757 0.915       0.00168

`dplyr::mutate()` - FDR example

You can create multiple columns in the same call to mutate():

gene_stats <- dplyr::mutate(gene_stats,
  test1_padj=p.adjust(test1_p,method="fdr"),
  test2_padj=p.adjust(test2_p,method="fdr")
)
gene_stats

## # A tibble: 4 × 7
##   gene  test1_stat test1_p test2_stat  test2_p test1_padj test2_padj
##   <chr>      <dbl>   <dbl>      <dbl>    <dbl>      <dbl>      <dbl>
## 1 hoxd1      12.5  0.103       34.2   0.000013    0.206     0.000052
## 2 brca1       4.40 0.632       16.3   0.0421      0.843     0.0828  
## 3 brca2       9.67 0.932       12.7   0.0621      0.932     0.0828  
## 4 snca       45.7  0.00042      0.757 0.915       0.00168   0.915

`dplyr::mutate()` - Deriving from multiple columns

Here we create a column with TRUE or FALSE if either or both of the adjusted p-values are less than $0.05$:

gene_stats <- dplyr::mutate(gene_stats,
  signif_either=(test1_padj < 0.05 | test2_padj < 0.05),
  signif_both=(test1_padj < 0.05 & test2_padj < 0.05)
)

| and & operators execute ‘or’ and ‘and’ logic, respectively

`dplyr::mutate()` - using derived columns

Columns created first in a mutate() call can be used in subsequent column definitions:

gene_stats <- dplyr::mutate(gene_stats,
  test1_padj=p.adjust(test1_p,method="fdr"), # test1_padj created
  test2_padj=p.adjust(test2_p,method="fdr"),
  signif_either=(test1_padj < 0.05 | test2_padj < 0.05), #test1_padj used
  signif_both=(test1_padj < 0.05 & test2_padj < 0.05)
)

`dplyr::mutate()` - Modifying columns

mutate() can also be used to modify columns in place

Example replaces the values in the gene column with upper cased values

dplyr::mutate(gene_stats, gene=stringr::str_to_upper(gene) )

## # A tibble: 4 × 9
##   gene  test1_stat test1_p test2_stat  test2_p test1_padj test2_padj
##   <chr>      <dbl>   <dbl>      <dbl>    <dbl>      <dbl>      <dbl>
## 1 HOXD1      12.5  0.103       34.2   0.000013    0.206     0.000052
## 2 BRCA1       4.40 0.632       16.3   0.0421      0.843     0.0828  
## 3 BRCA2       9.67 0.932       12.7   0.0621      0.932     0.0828  
## 4 SNCA       45.7  0.00042      0.757 0.915       0.00168   0.915   
## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>

Arranging Data - `dplyr::filter()`

`dplyr::filter()` - Pick rows out of a data set

Consider the following tibble with made up gene information

gene_stats

## # A tibble: 4 × 9
##   gene  test1_stat test1_p test2_stat  test2_p test1_padj test2_padj
##   <chr>      <dbl>   <dbl>      <dbl>    <dbl>      <dbl>      <dbl>
## 1 HOXD1      12.5  0.103       34.2   0.000013    0.206     0.000052
## 2 BRCA1       4.40 0.632       16.3   0.0421      0.843     0.0828  
## 3 BRCA2       9.67 0.932       12.7   0.0621      0.932     0.0828  
## 4 SNCA       45.7  0.00042      0.757 0.915       0.00168   0.915   
## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>

`dplyr::filter()` - Filter based on text values

We can use filter() on the data frame to look for the genes BRCA1 and BRCA2

gene_stats %>% filter(gene == "BRCA1" | gene == "BRCA2")

## # A tibble: 2 × 9
##   gene  test1_stat test1_p test2_stat test2_p test1_padj test2_padj
##   <chr>      <dbl>   <dbl>      <dbl>   <dbl>      <dbl>      <dbl>
## 1 BRCA1       4.40   0.632       16.3  0.0421      0.843     0.0828
## 2 BRCA2       9.67   0.932       12.7  0.0621      0.932     0.0828
## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>

`dplyr::filter()` - Filter based on numeric values

We can use filter() to restrict genes to those that are significant at padj < 0.01

gene_stats %>% filter(test1_padj < 0.01)

## # A tibble: 1 × 9
##   gene  test1_stat test1_p test2_stat test2_p test1_padj test2_padj
##   <chr>      <dbl>   <dbl>      <dbl>   <dbl>      <dbl>      <dbl>
## 1 SNCA        45.7 0.00042      0.757   0.915    0.00168      0.915
## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>

`stringr` - Working with character values

Base R does not have very convenient functions for working with character strings
We must frequently manipulate strings while loading, cleaning, and analyzing datasets
The stringr package aims to make working with strings “as easy as possible.”

`stringr` - Working with character values

Package includes many useful functions for operating on strings:
- searching for patterns
- mutating strings
- lexicographical sorting
- concatenation
- complex search/replace operations
stringr documentation and the very helpful stringr cheatsheet.

`stringr` Example: upper case

The function stringr::str_to_upper() with the dplyr::mutate() function to cast an existing column to upper case

dplyr::mutate(gene_stats,
  gene=stringr::str_to_upper(gene)
)

## # A tibble: 4 × 9
##   gene  test1_stat test1_p test2_stat  test2_p test1_padj test2_padj
##   <chr>      <dbl>   <dbl>      <dbl>    <dbl>      <dbl>      <dbl>
## 1 HOXD1      12.5  0.103       34.2   0.000013    0.206     0.000052
## 2 BRCA1       4.40 0.632       16.3   0.0421      0.843     0.0828  
## 3 BRCA2       9.67 0.932       12.7   0.0621      0.932     0.0828  
## 4 SNCA       45.7  0.00042      0.757 0.915       0.00168   0.915   
## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>

Regular expressions

Many operations in stringr package use regular expression syntax
A regular expression is a “mini” programming language that describes patterns in text
Certain characters have special meaning that help in defining search patterns that identifies the location of sequences of characters in text
Similar to but more powerful than “Find” functionality in many word processors

Regular expression example

Consider the tibble with made up gene information:

gene_stats

## # A tibble: 4 × 9
##   gene  test1_stat test1_p test2_stat  test2_p test1_padj test2_padj
##   <chr>      <dbl>   <dbl>      <dbl>    <dbl>      <dbl>      <dbl>
## 1 HOXD1      12.5  0.103       34.2   0.000013    0.206     0.000052
## 2 BRCA1       4.40 0.632       16.3   0.0421      0.843     0.0828  
## 3 BRCA2       9.67 0.932       12.7   0.0621      0.932     0.0828  
## 4 SNCA       45.7  0.00042      0.757 0.915       0.00168   0.915   
## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>

Used filter() to look for literal “BRCA1” and “BRCA2”
Names follow a pattern of BRCAX, where X is 1 or 2

Regular expression example

stringr::str_detect() returns TRUE if the provided pattern matches the input and FALSE otherwise

stringr::str_detect(c("HOX1A","BRCA1","BRCA2","SNCA"), "^BRCA[12]$")

## [1] FALSE  TRUE  TRUE FALSE

dplyr::filter(gene_stats, str_detect(gene,"^BRCA[12]$"))

## # A tibble: 2 × 9
##   gene  test1_stat test1_p test2_stat test2_p test1_padj test2_padj
##   <chr>      <dbl>   <dbl>      <dbl>   <dbl>      <dbl>      <dbl>
## 1 BRCA1       4.40   0.632       16.3  0.0421      0.843     0.0828
## 2 BRCA2       9.67   0.932       12.7  0.0621      0.932     0.0828
## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>

Regular expression example

dplyr::filter(gene_stats, str_detect(gene,"^BRCA[12]$"))

The argument "^BRCA[12]$" is a regular expression that searches for the following:
- Gene name starts with BRCA (^BRCA)
- Of those, include genes only if followed by either 1 or 2 ([12])
- Of those, match successfully if the number is last character ($)

Regular expression syntax

Regular expression syntax has certain characters with special meaning:
- . - match any single character
- * - match zero or more of the character immediately preceding the * "", "1", "11", "111", ...
- + - match one or more of the character immediately preceding the * "1", "11", "111", ...
- [XYZ] - match one of any of the characters between [] (e.g. X, Y, or Z)
- ^, $ - match the beginning and end of the string, respectively
There are more special characters as well, good tutorial: RegexOne - regular expression tutorial

Arranging Data - `dplyr::select()`

`dplyr::select()` - Subset Columns by Name

dplyr::select() function picks columns out of a tibble:

stats <- dplyr::select(gene_stats, test1_stat, test2_stat)
stats

## # A tibble: 4 × 2
##   test1_stat test2_stat
##        <dbl>      <dbl>
## 1      12.5      34.2  
## 2       4.40     16.3  
## 3       9.67     12.7  
## 4      45.7       0.757

`dplyr::select()` - Helper functions

dplyr also has “helper functions” for more flexible selection of columns
For example, if all of the columns we wished to select ended with _stat, we could use the ends_with() helper function:

stats <- dplyr::select(gene_stats, ends_with("_stat"))
stats

## # A tibble: 4 × 2
##   test1_stat test2_stat
##        <dbl>      <dbl>
## 1      12.5      34.2  
## 2       4.40     16.3  
## 3       9.67     12.7  
## 4      45.7       0.757

`dplyr::select()` - Renaming columns

select() allows for the renaming of selected columns:

stats <- dplyr::select(gene_stats,
  t=test1_stat,
  chisq=test2_stat
)
stats

## # A tibble: 4 × 2
##       t  chisq
##   <dbl>  <dbl>
## 1 12.5  34.2  
## 2  4.40 16.3  
## 3  9.67 12.7  
## 4 45.7   0.757

`dplyr::select()` - Reorder columns

dplyr::select(gene_stats,
    gene,
    test1_stat, test1_p, test1_padj,
    test2_stat, test2_p, test2_padj,
    signif_either,
    signif_both
)

## # A tibble: 4 × 9
##   gene  test1_stat test1_p test1_padj test2_stat  test2_p test2_padj
##   <chr>      <dbl>   <dbl>      <dbl>      <dbl>    <dbl>      <dbl>
## 1 HOXD1      12.5  0.103      0.206       34.2   0.000013   0.000052
## 2 BRCA1       4.40 0.632      0.843       16.3   0.0421     0.0828  
## 3 BRCA2       9.67 0.932      0.932       12.7   0.0621     0.0828  
## 4 SNCA       45.7  0.00042    0.00168      0.757 0.915      0.915   
## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>

Arranging Data - `dplyr::arrange()`

`dplyr::arrange()` - Order rows based on their values

Sometimes want to reorder rows, e.g. by p-value

dplyr::arrange(gene_stats, test1_p)

## # A tibble: 4 × 9
##   gene  test1_stat test1_p test2_stat  test2_p test1_padj test2_padj
##   <chr>      <dbl>   <dbl>      <dbl>    <dbl>      <dbl>      <dbl>
## 1 SNCA       45.7  0.00042      0.757 0.915       0.00168   0.915   
## 2 HOXD1      12.5  0.103       34.2   0.000013    0.206     0.000052
## 3 BRCA1       4.40 0.632       16.3   0.0421      0.843     0.0828  
## 4 BRCA2       9.67 0.932       12.7   0.0621      0.932     0.0828  
## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>

`dplyr::arrange()` - Changing sort direction

dplyr::arrange() sort ascending by default
Can change order to descending with desc() helper function

# desc() sorts descending
dplyr::arrange(gene_stats, desc(abs(test1_stat)))

## # A tibble: 4 × 9
##   gene  test1_stat test1_p test2_stat  test2_p test1_padj test2_padj
##   <chr>      <dbl>   <dbl>      <dbl>    <dbl>      <dbl>      <dbl>
## 1 SNCA       45.7  0.00042      0.757 0.915       0.00168   0.915   
## 2 HOXD1      12.5  0.103       34.2   0.000013    0.206     0.000052
## 3 BRCA2       9.67 0.932       12.7   0.0621      0.932     0.0828  
## 4 BRCA1       4.40 0.632       16.3   0.0421      0.843     0.0828  
## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>

Putting it all together

In the previous sections, we performed the following operations:

Created new FDR on the nominal p-values using the dplyr::mutate() and p.adjust functions
Created new boolean significance for each gene using dplyr::mutate()
Mutated the gene symbol case using stringr::str_to_upper and dplyr::mutate()
Reordered the columns to group related variables with select()
Filtered genes by FDR < 0.05 for either and both statistical tests using dplyr::filter()
Sorted the results by p-value using dplyr::arrange()

Putting it all together

gene_stats <- dplyr::mutate(gene_stats,
  test1_padj=p.adjust(test1_p,method="fdr"),
  test2_padj=p.adjust(test2_p,method="fdr"),
  signif_either=(test1_padj < 0.05 | test2_padj < 0.05),
  signif_both=(test1_padj < 0.05 & test2_padj < 0.05),
  gene=stringr::str_to_upper(gene)
) %>%
dplyr::select(
    gene,
    test1_stat, test1_p, test1_padj,
    test2_stat, test2_p, test2_padj,
    signif_either,
    signif_both
) %>%
dplyr::filter( test1_padj < 0.05 | test2_padj < 0.05 ) %>%
dplyr::arrange( test1_p )

Grouping Data

Grouping data together allow us to summarize them

Consider:

metadata <- tribble(
    ~ID, ~condition, ~age_at_death, ~Braak_stage, ~APOE_genotype,
  "A01",        "AD",            78,            5,       "e4/e4",
  "A02",        "AD",            81,            6,       "e3/e4",
  "A03",        "AD",            90,            5,       "e4/e4",
  "A04",   "Control",            80,            1,       "e3/e4",
  "A05",   "Control",            79,            0,       "e3/e3",
  "A06",   "Control",            81,            0,       "e2/e3"
)

Summarizing groups of data

metadata

## # A tibble: 6 × 5
##   ID    condition age_at_death Braak_stage APOE_genotype
##   <chr> <chr>            <dbl>       <dbl> <chr>        
## 1 A01   AD                  78           5 e4/e4        
## 2 A02   AD                  81           6 e3/e4        
## 3 A03   AD                  90           5 e4/e4        
## 4 A04   Control             80           1 e3/e4        
## 5 A05   Control             79           0 e3/e3        
## 6 A06   Control             81           0 e2/e3

How many samples of each condition are there?
Are the age at death/Braak stage distributions similar?
Must group samples together based on condition to answer these kinds of questions

Grouping with `dplyr::group_by()`

dplyr::group_by() groups rows together based on condition

dplyr::group_by(metadata,
  condition
)

## # A tibble: 6 × 5
## # Groups:   condition [2]
##   ID    condition age_at_death Braak_stage APOE_genotype
##   <chr> <chr>            <dbl>       <dbl> <chr>        
## 1 A01   AD                  78           5 e4/e4        
## 2 A02   AD                  81           6 e3/e4        
## 3 A03   AD                  90           5 e4/e4        
## 4 A04   Control             80           1 e3/e4        
## 5 A05   Control             79           0 e3/e3        
## 6 A06   Control             81           0 e2/e3

Summarizing with `dplyr::summarize()`

dplyr::summarize() (or summarise()) to compute the mean age at death for each group:

dplyr::group_by(metadata,
  condition
) %>% dplyr::summarize(mean_age_at_death = mean(age_at_death))

## # A tibble: 2 × 2
##   condition mean_age_at_death
##   <chr>                 <dbl>
## 1 AD                       83
## 2 Control                  80

Summarizing more

dplyr::group_by(metadata,
  condition
) %>% dplyr::summarize(
  mean_age_at_death = mean(age_at_death),
  sd_age_at_death = sd(age_at_death),
  lower_age = mean_age_at_death-sd_age_at_death,
  upper_age = mean_age_at_death+sd_age_at_death,
)

## # A tibble: 2 × 5
##   condition mean_age_at_death sd_age_at_death lower_age upper_age
##   <chr>                 <dbl>           <dbl>     <dbl>     <dbl>
## 1 AD                       83            6.24      76.8      89.2
## 2 Control                  80            1         79        81

Helper functions with `summarize()`

dplyr::summarize() has some helper functions
n() provides the number of rows each group has

dplyr::group_by(metadata,
  condition
) %>% dplyr::summarize(
  num_subjects = n(),
  mean_age_at_death = mean(age_at_death),
)

## # A tibble: 2 × 3
##   condition num_subjects mean_age_at_death
##   <chr>            <int>             <dbl>
## 1 AD                   3                83
## 2 Control              3                80

Rearranging Data

Sometimes the shape and format of our data doesn’t let us summarize easily

Consider our previous example:

gene_stats <- tribble(
    ~gene, ~test1_stat, ~test1_p, ~test2_stat, ~test2_p,
   "APOE",   12.509293,   0.1032,   34.239521,   1.3e-5,
  "HOXD1",    4.399211,   0.6323,   16.332318,   0.0421,
   "SNCA",   45.748431,   4.2e-9,    0.757188,   0.9146,
)

Rearranging Data

What are the mean and range of our statistic columns?

Could do this manually like so:

tribble(
   ~test_name, ~min, ~mean, ~max,
   "test1_stat", min(gene_stats$test1_stat),
       mean(gene_stats$test1_stat), max(gene_stats$test1_stat),
   "test2_stat",  min(gene_stats$test2_stat),
       mean(gene_stats$test2_stat), max(gene_stats$test2_stat),
)

## # A tibble: 2 × 4
##   test_name    min  mean   max
##   <chr>      <dbl> <dbl> <dbl>
## 1 test1_stat 4.40   20.9  45.7
## 2 test2_stat 0.757  17.1  34.2

Summarizing by column

In previous example, we were summarizing each statistic column
group_by() and summarize() summarize rows!
We can pivot the table so columns become rows with tidyr::pivot_longer()

`tidyr::pivot_longer()`

tidyr::pivot_longer(
  gene_stats,
  c(test1_stat, test2_stat), # columns to pivot
  names_to="test",
  values_to="stat"
)

## # A tibble: 6 × 5
##   gene       test1_p  test2_p test         stat
##   <chr>        <dbl>    <dbl> <chr>       <dbl>
## 1 APOE  0.103        0.000013 test1_stat 12.5  
## 2 APOE  0.103        0.000013 test2_stat 34.2  
## 3 HOXD1 0.632        0.0421   test1_stat  4.40 
## 4 HOXD1 0.632        0.0421   test2_stat 16.3  
## 5 SNCA  0.0000000042 0.915    test1_stat 45.7  
## 6 SNCA  0.0000000042 0.915    test2_stat  0.757

Pivot longer illustration

`tidyr::pivot_longer()`

long_gene_stats <- tidyr::pivot_longer(
  gene_stats,
  ends_with("_stat"), # was c(test1_stat, test2_stat),
  names_to="test",
  values_to="stat"
)

## # A tibble: 6 × 5
##   gene       test1_p  test2_p test         stat
##   <chr>        <dbl>    <dbl> <chr>       <dbl>
## 1 APOE  0.103        0.000013 test1_stat 12.5  
## 2 APOE  0.103        0.000013 test2_stat 34.2  
## 3 HOXD1 0.632        0.0421   test1_stat  4.40 
## 4 HOXD1 0.632        0.0421   test2_stat 16.3  
## 5 SNCA  0.0000000042 0.915    test1_stat 45.7  
## 6 SNCA  0.0000000042 0.915    test2_stat  0.757

Pivot, Group, Summarize

To summarize, group_by() on the test column and summarize() on the stat column:

long_gene_stats %>%
  dplyr::group_by(test) %>%
  dplyr::summarize(
    min = min(stat), mean = mean(stat), max = max(stat)
  )

## # A tibble: 2 × 4
##   test         min  mean   max
##   <chr>      <dbl> <dbl> <dbl>
## 1 test1_stat 4.40   20.9  45.7
## 2 test2_stat 0.757  17.1  34.2

Note: `tidyr::pivot_wider()`

Inverse of pivot_longer() is pivot_wider()
If you have variables gathered in single columns like that produced by pivot_longer(), reverses process to create tibble with those variables as columns
Can “reorganize” tibble by first pivot_longer() followed by pivot_wider()

Pivot wider illustration

# Relational Data

Relational Data

Often need to combine different sources of data together to aid in interpretation of results
Consider our fake gene statistics tibble:

tribble(
    ~gene, ~test1_stat, ~test1_p, ~test2_stat, ~test2_p,
   "APOE",   12.509293,   0.1032,   34.239521,   1.3e-5,
  "HOXD1",    4.399211,   0.6323,   16.332318,   0.0421,
   "SNCA",   45.748431,   4.2e-9,    0.757188,   0.9146,
)

## # A tibble: 3 × 5
##   gene  test1_stat      test1_p test2_stat  test2_p
##   <chr>      <dbl>        <dbl>      <dbl>    <dbl>
## 1 APOE       12.5  0.103            34.2   0.000013
## 2 HOXD1       4.40 0.632            16.3   0.0421  
## 3 SNCA       45.7  0.0000000042      0.757 0.915

Match rows between tibbles

gene column in tibble is a gene symbol
Other gene IDs exist for the same gene, e.g. Ensembl Gene IDs, that are associated with other information

Need to match up gene symbol with corresponding Ensembl Gene ID using a known mapping:

gene_map <- tribble(
    ~symbol,            ~ENSGID,                    ~gene_name,
     "APOE",  "ENSG00000130203",            "apolipoprotein E",
    "BRCA1",  "ENSG00000012048", "BRCA1 DNA repair associated",
    "HOXD1",  "ENSG00000128645",                 "homeobox D1",
     "SNCA",  "ENSG00000145335",             "synuclein alpha",
)

Match rows between tibbles

## # A tibble: 4 × 3
##   symbol ENSGID          gene_name                  
##   <chr>  <chr>           <chr>                      
## 1 APOE   ENSG00000130203 apolipoprotein E           
## 2 BRCA1  ENSG00000012048 BRCA1 DNA repair associated
## 3 HOXD1  ENSG00000128645 homeobox D1                
## 4 SNCA   ENSG00000145335 synuclein alpha

Match rows between tibbles

Term for matching up rows of two tibbles (tables) called joining
We join the gene_stats tibble with the gene_map tibble
Match the gene column with the symbol column
dplyr::left_join() function accepts two data frames and the names of columns in each that share common values:
```
dplyr::left_join(
    x=gene_stats,
    y=gene_map,
    by=c("gene" = "symbol")
)
```

Match rows between tibbles

## # A tibble: 3 × 7
##   gene  test1_stat      test1_p test2_stat  test2_p ENSGID          gene_name   
##   <chr>      <dbl>        <dbl>      <dbl>    <dbl> <chr>           <chr>       
## 1 APOE       12.5  0.103            34.2   0.000013 ENSG00000130203 apolipoprot…
## 2 HOXD1       4.40 0.632            16.3   0.0421   ENSG00000128645 homeobox D1 
## 3 SNCA       45.7  0.0000000042      0.757 0.915    ENSG00000145335 synuclein a…

Match rows between tibbles

Same thing with pipe:

gene_stats %>% dplyr::left_join(
  gene_map,
  by=c("gene" = "symbol")
)

## # A tibble: 3 × 7
##   gene  test1_stat      test1_p test2_stat  test2_p ENSGID          gene_name   
##   <chr>      <dbl>        <dbl>      <dbl>    <dbl> <chr>           <chr>       
## 1 APOE       12.5  0.103            34.2   0.000013 ENSG00000130203 apolipoprot…
## 2 HOXD1       4.40 0.632            16.3   0.0421   ENSG00000128645 homeobox D1 
## 3 SNCA       45.7  0.0000000042      0.757 0.915    ENSG00000145335 synuclein a…

Missing values in join

The “direction” of the join determines which values appear in output
Left join means include all rows in “left” tibble even if there is no match

Note BRCA1 is in gene_map but not gene_stats:

dplyr::left_join(
  gene_map, # gene_map is "left"
  gene_stats, # gene_stats is "right"
  by=c("symbol" = "gene")
)

Missing values in join

## # A tibble: 4 × 7
##   symbol ENSGID          gene_name       test1_stat  test1_p test2_stat  test2_p
##   <chr>  <chr>           <chr>                <dbl>    <dbl>      <dbl>    <dbl>
## 1 APOE   ENSG00000130203 apolipoprotein…      12.5   1.03e-1     34.2    1.3 e-5
## 2 BRCA1  ENSG00000012048 BRCA1 DNA repa…      NA    NA           NA     NA      
## 3 HOXD1  ENSG00000128645 homeobox D1           4.40  6.32e-1     16.3    4.21e-2
## 4 SNCA   ENSG00000145335 synuclein alpha      45.7   4.20e-9      0.757  9.15e-1

Join directions

“Right joins” are simply the opposite of left joins:

dplyr::right_join(
  gene_stats, # gene_stats is "left"
gene_map, # gene_map is "right"
  by=c("gene" = "symbol")
)

## # A tibble: 4 × 7
##   gene  test1_stat       test1_p test2_stat   test2_p ENSGID          gene_name 
##   <chr>      <dbl>         <dbl>      <dbl>     <dbl> <chr>           <chr>     
## 1 APOE       12.5   0.103            34.2    0.000013 ENSG00000130203 apolipopr…
## 2 HOXD1       4.40  0.632            16.3    0.0421   ENSG00000128645 homeobox …
## 3 SNCA       45.7   0.0000000042      0.757  0.915    ENSG00000145335 synuclein…
## 4 BRCA1      NA    NA                NA     NA        ENSG00000012048 BRCA1 DNA…

Inner join

Inner joins return results only for rows that have a match between the two tibbles
Recall gene_map included BRCA1 even though it was not found in gene_stats
Inner join will not include this row, because no match in gene_stats was found:
```
dplyr::inner_join(
  gene_map,
  gene_stats,
  by=c("symbol" = "gene")
)
```

Inner join

## # A tibble: 3 × 7
##   symbol ENSGID          gene_name        test1_stat  test1_p test2_stat test2_p
##   <chr>  <chr>           <chr>                 <dbl>    <dbl>      <dbl>   <dbl>
## 1 APOE   ENSG00000130203 apolipoprotein E      12.5   1.03e-1     34.2   1.3 e-5
## 2 HOXD1  ENSG00000128645 homeobox D1            4.40  6.32e-1     16.3   4.21e-2
## 3 SNCA   ENSG00000145335 synuclein alpha       45.7   4.20e-9      0.757 9.15e-1

Full or outer join

dplyr::full_join() (also sometimes called an outer join)
a full join will return all rows from both tibbles whether a match in the other table was found or not

Multiple matches

Sometimes one table will have multiple rows that match the values in the other
Read the textbook chapter on how to handle

The Tidyverse

Tidyverse Basics

Default tidyverse packages

Importing Data

Note readr functions operate on zipped files

Writing tabular files

The tibble

The tibble is a data frame

tibble column names

Changing tibble column names

tibbles, data frames, and row names

tribble() - Constructing tibbles by hand

Tidy Data

Example of tidy data

Tidy data illustration

pipes with %>%

pipes

Base R serial manipulations

pipes with %>%

Arranging Data

Arranging Data - dplyr::mutate()

dplyr::mutate() - Create new columns using other columns

Made up gene statistics example

dplyr::mutate() - FDR example

dplyr::mutate() - FDR example

dplyr::mutate() - Deriving from multiple columns

dplyr::mutate() - using derived columns

dplyr::mutate() - Modifying columns

Arranging Data - dplyr::filter()

dplyr::filter() - Pick rows out of a data set

dplyr::filter() - Filter based on text values

dplyr::filter() - Filter based on numeric values

stringr - Working with character values

stringr - Working with character values

stringr - Working with character values

stringr Example: upper case

Regular expressions

Regular expression example

Regular expression example

Regular expression example

Regular expression syntax

Arranging Data - dplyr::select()

dplyr::select() - Subset Columns by Name

dplyr::select() - Helper functions

dplyr::select() - Renaming columns

dplyr::select() - Reorder columns

Arranging Data - dplyr::arrange()

dplyr::arrange() - Order rows based on their values

dplyr::arrange() - Changing sort direction

Putting it all together

Putting it all together

Grouping Data

Grouping Data

Summarizing groups of data

Grouping with dplyr::group_by()

Summarizing with dplyr::summarize()

Summarizing more

Helper functions with summarize()

Rearranging Data

Rearranging Data

Rearranging Data

Summarizing by column

tidyr::pivot_longer()

Pivot longer illustration

tidyr::pivot_longer()

Pivot, Group, Summarize

Note: tidyr::pivot_wider()

Pivot wider illustration

Relational Data

Match rows between tibbles

Match rows between tibbles

Match rows between tibbles

Match rows between tibbles

Match rows between tibbles

Missing values in join

Missing values in join

Join directions

Inner join

Inner join

Full or outer join

Note `readr` functions operate on zipped files

`tribble()` - Constructing tibbles by hand

pipes with `%>%`

pipes with `%>%`

Arranging Data - `dplyr::mutate()`

`dplyr::mutate()` - Create new columns using other columns

`dplyr::mutate()` - FDR example

`dplyr::mutate()` - FDR example

`dplyr::mutate()` - Deriving from multiple columns

`dplyr::mutate()` - using derived columns

`dplyr::mutate()` - Modifying columns

Arranging Data - `dplyr::filter()`

`dplyr::filter()` - Pick rows out of a data set

`dplyr::filter()` - Filter based on text values

`dplyr::filter()` - Filter based on numeric values

`stringr` - Working with character values

`stringr` - Working with character values

`stringr` - Working with character values

`stringr` Example: upper case

Arranging Data - `dplyr::select()`

`dplyr::select()` - Subset Columns by Name

`dplyr::select()` - Helper functions

`dplyr::select()` - Renaming columns

`dplyr::select()` - Reorder columns

Arranging Data - `dplyr::arrange()`

`dplyr::arrange()` - Order rows based on their values

`dplyr::arrange()` - Changing sort direction

Grouping with `dplyr::group_by()`

Summarizing with `dplyr::summarize()`

Helper functions with `summarize()`

`tidyr::pivot_longer()`

`tidyr::pivot_longer()`

Note: `tidyr::pivot_wider()`