add <- function(x,y) { return(x+y) }
Can test this manually:
result <- add(1,2) result == 3 [1] TRUE
add <- function(x,y) { return(x+y+1) }
Test FAILS:
result <- add(1,2) result == 3 [1] FALSE
testthat
Packageinstall.packages("testthat")
testthat
Unit Testlibrary(testthat) test_that("add() correctly adds things", { expect_equal(add(1,2), 3) expect_equal(add(5,6), 11) } ) Test passed
test_that
Anatomytest_that("add() correctly adds things", { expect_equal(add(1,2), 3) expect_equal(add(5,6), 11) } )
add() correctly adds things
){}
written using expect_X
functions from the testthat packagetest_that
Failurestest_that("add() correctly adds things", { expect_equal(add(1,2), 3) expect_equal(add(5,6), 10) # incorrect expected value } ) -- Failure (Line 3): add() correctly adds things ------------------------------- add(5, 6) not equal to 10. 1/1 mismatches [1] 11 - 10 == 1 Error: Test failed
test_functions.R
test_file()
add <- function(x,y) { return(x+y) } testthat::test_file("test_functions.R") == Testing test_functions.R ======================================================= [ FAIL 0 | WARN 0 | SKIP 0 | PASS 2 ] Done!
testthat
tests calling functions that are not defined will fail like tests that find incorrect outputtest_that("mul() correctly multiplies things",{ expect_equal(mul(1,2), 2) }) -- Error (Line 1): new function ------------------------------------------------ Error in `mul(1, 2)`: could not find function "mul" Backtrace: 1. testthat::expect_equal(mul(1, 2), 2) 2. testthat::quasi_label(enquo(object), label, arg = "object") 3. rlang::eval_bare(expr, quo_get_env(quo)) Error: Test failed
Each assignment has similar format and workflow, files:
├── reference_report.html ├── main.R ├── README.md ├── report.Rmd └── test_main.R
tidyverse is a set of packages that work together
Load the most common tidyverse packages at the same time:
library(tidyverse) -- Attaching packages ---------- tidyverse 1.3.1 -- v ggplot2 3.3.5 v purrr 0.3.4 v tibble 3.1.6 v dplyr 1.0.7 v tidyr 1.1.4 v stringr 1.4.0 v readr 2.1.1 v forcats 0.5.1 -- Conflicts ------------- tidyverse_conflicts() -- x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag()
Package | Description |
---|---|
ggplot2 | Plotting using the grammar of graphics |
tibble | Simple and sane data frames |
tidyr | Operations for making data “tidy” |
readr | Read rectangular text data into tidyverse |
purrr | Functional programming tools for tidyverse |
dplyr | “A Grammar of Data Manipulation” |
stringr | Makes working with strings in R easier |
forcats | Operations for using categorical variables |
Function | Brief description/use |
---|---|
read_csv |
Delimiter: , - Decimal separator: . |
read_csv2 |
Delimiter: ; - Decimal separator: , |
read_tsv |
Delimiter: <tab> - Decimal separator: . |
read_delim |
Delimiter: set by user - Decimal separator: . |
readr
functions operate on zipped filesreadr
file reading functions can read compressed files directly, so you do not need to decompress them firsreadr
also has functions for writing delimited files:Function | Brief description/use |
---|---|
write_csv |
Delimiter: , - Decimal separator: . |
write_csv2 |
Delimiter: ; - Decimal separator: , |
write_tsv |
Delimiter: <tab> - Decimal separator: . |
write_delim |
Delimiter: set by user - Decimal separator: . |
csv
- character separated value file.csv
id,somevalue,category,genes 1,2.123,A,APOE 4,5.123,B,"HOXA1,HOXB1" 7,8.123,,SNCA
,
,
), the value is wrapped in double quotes,,
or one ,
at the end of the line, if the last column value is missing)tibble
library(tibble) tbl <- tibble( x = rnorm(100, mean=20, sd=10), y = rnorm(100, mean=50, sd=5) ) tbl # A tibble: 100 x 2 x y1 16.5 54.6 2 14.4 54.3 # ... with 98 more rows
tibble
stores rectangular data that can be accessed like a data frame:tbl$x [1] 29.57249 12.01577 15.25536 23.07761 32.25403 48.04651 21.90576 [8] 15.51168 34.87285 21.32433 12.51230 23.60896 6.77630 12.34223 ... tbl[1,"x"] # access the first element of x # A tibble: 1 x 1 x1 29.6 tbl$x[1] [1] 29.57255
tibbles
(and regular data frames) typically have names for their columns accessed with the colnames
functiontbl <- tibble( x = rnorm(100, mean=20, sd=10), y = rnorm(100, mean=50, sd=5) ) colnames(tbl) [1] "x" "y"
colnames(tbl) <- c("a","b") tbl # A tibble: 100 x 2 a b1 16.5 54.6 2 14.4 54.3 # ... with 90 more rows
dplyr::rename
to rename columns as well:dplyr::rename(tbl, a = x, b = y )
tbl <- tibble( x = rnorm(100, mean=20, sd=10), y = rnorm(100, mean=50, sd=5) ) rownames(tbl) [1] "1" "2" "3"...
tibble
support for row names is only included for compatibility with base R data framestribble()
- Constructing tibbles by handtribble()
function:gene_stats <- tribble( ~gene, ~test1_stat, ~test1_p, ~test2_stat, ~test2_p, "hoxd1", 12.509293, 0.1032, 34.239521, 1.3e-5, "brca1", 4.399211, 0.6323, 16.332318, 0.0421, "brca2", 9.672011, 0.9323, 12.689219, 0.0621, "snca", 45.748431, 4.2e-4, 0.757188, 0.9146, )
tidyverse packages designed to operate with so-called “tidy data”
the following rules make data tidy:
a variable is a quantity or property that every observation in our dataset has
each observation is a separate instance of those variable
gene_stats
## # A tibble: 4 × 5 ## gene test1_stat test1_p test2_stat test2_p ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 hoxd1 12.5 0.103 34.2 0.000013 ## 2 brca1 4.40 0.632 16.3 0.0421 ## 3 brca2 9.67 0.932 12.7 0.0621 ## 4 snca 45.7 0.00042 0.757 0.915
%>%
In base R we would need to do this with assignments
# data_file.csv has two columns: bad_cOlumn_name and numeric_column data <- readr::read_csv("data_file.csv") data <- dplyr::rename(data, "better_column_name"=bad_cOlumn_name) data <- dplyr::filter(data, better_column_name %in% c("condA","condB")) data_grouped <- dplyr::group_by(data, better_column_name) summarized <- dplyr::summarize(data_grouped, mean(numeric_column))
%>%
A key tidyverse
programming pattern is chaining manipulations of tibble
s together by passing the result of one manipulation as input to the next
Tidyverse defines the %>%
operator to do this:
data <- readr::read_csv("data_file.csv") %>% dplyr::rename("better_column_name"=bad_cOlumn_name) %>% dplyr::filter(better_column_name %in% c("condA","condB")) %>% dplyr::group_by(better_column_name) %>% dplyr::summarize(mean(numeric_column))
The %>%
operator passes the result of the function immediately preceding it as the first argument to the next function automatically
dplyr
packagedplyr::mutate()
dplyr::mutate()
- Create new columns using other columnsp.adjust
function can perform several of these procedures, including FDRgene_stats
## # A tibble: 4 × 5 ## gene test1_stat test1_p test2_stat test2_p ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 hoxd1 12.5 0.103 34.2 0.000013 ## 2 brca1 4.40 0.632 16.3 0.0421 ## 3 brca2 9.67 0.932 12.7 0.0621 ## 4 snca 45.7 0.00042 0.757 0.915
dplyr::mutate()
- FDR examplegene_stats <- dplyr::mutate(gene_stats, test1_padj=p.adjust(test1_p,method="fdr") ) gene_stats
## # A tibble: 4 × 6 ## gene test1_stat test1_p test2_stat test2_p test1_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 hoxd1 12.5 0.103 34.2 0.000013 0.206 ## 2 brca1 4.40 0.632 16.3 0.0421 0.843 ## 3 brca2 9.67 0.932 12.7 0.0621 0.932 ## 4 snca 45.7 0.00042 0.757 0.915 0.00168
dplyr::mutate()
- FDR exampleYou can create multiple columns in the same call to mutate()
:
gene_stats <- dplyr::mutate(gene_stats, test1_padj=p.adjust(test1_p,method="fdr"), test2_padj=p.adjust(test2_p,method="fdr") ) gene_stats
## # A tibble: 4 × 7 ## gene test1_stat test1_p test2_stat test2_p test1_padj test2_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 hoxd1 12.5 0.103 34.2 0.000013 0.206 0.000052 ## 2 brca1 4.40 0.632 16.3 0.0421 0.843 0.0828 ## 3 brca2 9.67 0.932 12.7 0.0621 0.932 0.0828 ## 4 snca 45.7 0.00042 0.757 0.915 0.00168 0.915
dplyr::mutate()
- Deriving from multiple columnsHere we create a column with TRUE
or FALSE
if either or both of the adjusted p-values are less than \(0.05\):
gene_stats <- dplyr::mutate(gene_stats, signif_either=(test1_padj < 0.05 | test2_padj < 0.05), signif_both=(test1_padj < 0.05 & test2_padj < 0.05) )
|
and &
operators execute ‘or’ and ‘and’ logic, respectively
dplyr::mutate()
- using derived columnsmutate()
call can be used in subsequent column definitions:gene_stats <- dplyr::mutate(gene_stats, test1_padj=p.adjust(test1_p,method="fdr"), # test1_padj created test2_padj=p.adjust(test2_p,method="fdr"), signif_either=(test1_padj < 0.05 | test2_padj < 0.05), #test1_padj used signif_both=(test1_padj < 0.05 & test2_padj < 0.05) )
dplyr::mutate()
- Modifying columnsmutate()
can also be used to modify columns in place
Example replaces the values in the gene
column with upper cased values
dplyr::mutate(gene_stats, gene=stringr::str_to_upper(gene) )
## # A tibble: 4 × 9 ## gene test1_stat test1_p test2_stat test2_p test1_padj test2_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 HOXD1 12.5 0.103 34.2 0.000013 0.206 0.000052 ## 2 BRCA1 4.40 0.632 16.3 0.0421 0.843 0.0828 ## 3 BRCA2 9.67 0.932 12.7 0.0621 0.932 0.0828 ## 4 SNCA 45.7 0.00042 0.757 0.915 0.00168 0.915 ## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>
dplyr::filter()
dplyr::filter()
- Pick rows out of a data setgene_stats
## # A tibble: 4 × 9 ## gene test1_stat test1_p test2_stat test2_p test1_padj test2_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 HOXD1 12.5 0.103 34.2 0.000013 0.206 0.000052 ## 2 BRCA1 4.40 0.632 16.3 0.0421 0.843 0.0828 ## 3 BRCA2 9.67 0.932 12.7 0.0621 0.932 0.0828 ## 4 SNCA 45.7 0.00042 0.757 0.915 0.00168 0.915 ## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>
dplyr::filter()
- Filter based on text valuesfilter()
on the data frame to look for the genes BRCA1 and BRCA2gene_stats %>% filter(gene == "BRCA1" | gene == "BRCA2")
## # A tibble: 2 × 9 ## gene test1_stat test1_p test2_stat test2_p test1_padj test2_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 BRCA1 4.40 0.632 16.3 0.0421 0.843 0.0828 ## 2 BRCA2 9.67 0.932 12.7 0.0621 0.932 0.0828 ## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>
dplyr::filter()
- Filter based on numeric valuesfilter()
to restrict genes to those that are significant at padj < 0.01
gene_stats %>% filter(test1_padj < 0.01)
## # A tibble: 1 × 9 ## gene test1_stat test1_p test2_stat test2_p test1_padj test2_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 SNCA 45.7 0.00042 0.757 0.915 0.00168 0.915 ## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>
stringr
- Working with character valuesstringr
- Working with character valuesstringr
- Working with character valuesstringr
Example: upper caseThe function stringr::str_to_upper()
with the dplyr::mutate()
function to cast an existing column to upper case
dplyr::mutate(gene_stats, gene=stringr::str_to_upper(gene) )
## # A tibble: 4 × 9 ## gene test1_stat test1_p test2_stat test2_p test1_padj test2_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 HOXD1 12.5 0.103 34.2 0.000013 0.206 0.000052 ## 2 BRCA1 4.40 0.632 16.3 0.0421 0.843 0.0828 ## 3 BRCA2 9.67 0.932 12.7 0.0621 0.932 0.0828 ## 4 SNCA 45.7 0.00042 0.757 0.915 0.00168 0.915 ## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>
stringr
package use regular expression syntaxgene_stats
## # A tibble: 4 × 9 ## gene test1_stat test1_p test2_stat test2_p test1_padj test2_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 HOXD1 12.5 0.103 34.2 0.000013 0.206 0.000052 ## 2 BRCA1 4.40 0.632 16.3 0.0421 0.843 0.0828 ## 3 BRCA2 9.67 0.932 12.7 0.0621 0.932 0.0828 ## 4 SNCA 45.7 0.00042 0.757 0.915 0.00168 0.915 ## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>
filter()
to look for literal “BRCA1” and “BRCA2”stringr::str_detect()
returns TRUE
if the provided pattern matches the input and FALSE
otherwisestringr::str_detect(c("HOX1A","BRCA1","BRCA2","SNCA"), "^BRCA[12]$")
## [1] FALSE TRUE TRUE FALSE
dplyr::filter(gene_stats, str_detect(gene,"^BRCA[12]$"))
## # A tibble: 2 × 9 ## gene test1_stat test1_p test2_stat test2_p test1_padj test2_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 BRCA1 4.40 0.632 16.3 0.0421 0.843 0.0828 ## 2 BRCA2 9.67 0.932 12.7 0.0621 0.932 0.0828 ## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>
dplyr::filter(gene_stats, str_detect(gene,"^BRCA[12]$"))
The argument "^BRCA[12]$"
is a regular expression that searches for the following:
BRCA
(^BRCA
)1
or 2
([12]
)$
).
- match any single character*
- match zero or more of the character immediately preceding the *
"", "1", "11", "111", ...
+
- match one or more of the character immediately preceding the *
"1", "11", "111", ...
[XYZ]
- match one of any of the characters between []
(e.g. X
, Y
, or Z
)^
, $
- match the beginning and end of the string, respectivelydplyr::select()
dplyr::select()
- Subset Columns by Namedplyr::select()
function picks columns out of a tibble
:stats <- dplyr::select(gene_stats, test1_stat, test2_stat) stats
## # A tibble: 4 × 2 ## test1_stat test2_stat ## <dbl> <dbl> ## 1 12.5 34.2 ## 2 4.40 16.3 ## 3 9.67 12.7 ## 4 45.7 0.757
dplyr::select()
- Helper functionsdplyr
also has “helper functions” for more flexible selection of columns_stat
, we could use the ends_with()
helper function:stats <- dplyr::select(gene_stats, ends_with("_stat")) stats
## # A tibble: 4 × 2 ## test1_stat test2_stat ## <dbl> <dbl> ## 1 12.5 34.2 ## 2 4.40 16.3 ## 3 9.67 12.7 ## 4 45.7 0.757
dplyr::select()
- Renaming columnsselect()
allows for the renaming of selected columns:stats <- dplyr::select(gene_stats, t=test1_stat, chisq=test2_stat ) stats
## # A tibble: 4 × 2 ## t chisq ## <dbl> <dbl> ## 1 12.5 34.2 ## 2 4.40 16.3 ## 3 9.67 12.7 ## 4 45.7 0.757
dplyr::select()
- Reorder columnsdplyr::select(gene_stats, gene, test1_stat, test1_p, test1_padj, test2_stat, test2_p, test2_padj, signif_either, signif_both )
## # A tibble: 4 × 9 ## gene test1_stat test1_p test1_padj test2_stat test2_p test2_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 HOXD1 12.5 0.103 0.206 34.2 0.000013 0.000052 ## 2 BRCA1 4.40 0.632 0.843 16.3 0.0421 0.0828 ## 3 BRCA2 9.67 0.932 0.932 12.7 0.0621 0.0828 ## 4 SNCA 45.7 0.00042 0.00168 0.757 0.915 0.915 ## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>
dplyr::arrange()
dplyr::arrange()
- Order rows based on their valuesdplyr::arrange(gene_stats, test1_p)
## # A tibble: 4 × 9 ## gene test1_stat test1_p test2_stat test2_p test1_padj test2_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 SNCA 45.7 0.00042 0.757 0.915 0.00168 0.915 ## 2 HOXD1 12.5 0.103 34.2 0.000013 0.206 0.000052 ## 3 BRCA1 4.40 0.632 16.3 0.0421 0.843 0.0828 ## 4 BRCA2 9.67 0.932 12.7 0.0621 0.932 0.0828 ## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>
dplyr::arrange()
- Changing sort directiondplyr::arrange()
sort ascending by defaultdesc()
helper function# desc() sorts descending dplyr::arrange(gene_stats, desc(abs(test1_stat)))
## # A tibble: 4 × 9 ## gene test1_stat test1_p test2_stat test2_p test1_padj test2_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 SNCA 45.7 0.00042 0.757 0.915 0.00168 0.915 ## 2 HOXD1 12.5 0.103 34.2 0.000013 0.206 0.000052 ## 3 BRCA2 9.67 0.932 12.7 0.0621 0.932 0.0828 ## 4 BRCA1 4.40 0.632 16.3 0.0421 0.843 0.0828 ## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>
In the previous sections, we performed the following operations:
dplyr::mutate()
and p.adjust
functionsdplyr::mutate()
stringr::str_to_upper
and dplyr::mutate()
select()
dplyr::filter()
dplyr::arrange()
gene_stats <- dplyr::mutate(gene_stats, test1_padj=p.adjust(test1_p,method="fdr"), test2_padj=p.adjust(test2_p,method="fdr"), signif_either=(test1_padj < 0.05 | test2_padj < 0.05), signif_both=(test1_padj < 0.05 & test2_padj < 0.05), gene=stringr::str_to_upper(gene) ) %>% dplyr::select( gene, test1_stat, test1_p, test1_padj, test2_stat, test2_p, test2_padj, signif_either, signif_both ) %>% dplyr::filter( test1_padj < 0.05 | test2_padj < 0.05 ) %>% dplyr::arrange( test1_p )
Grouping data together allow us to summarize them
Consider:
metadata <- tribble( ~ID, ~condition, ~age_at_death, ~Braak_stage, ~APOE_genotype, "A01", "AD", 78, 5, "e4/e4", "A02", "AD", 81, 6, "e3/e4", "A03", "AD", 90, 5, "e4/e4", "A04", "Control", 80, 1, "e3/e4", "A05", "Control", 79, 0, "e3/e3", "A06", "Control", 81, 0, "e2/e3" )
metadata
## # A tibble: 6 × 5 ## ID condition age_at_death Braak_stage APOE_genotype ## <chr> <chr> <dbl> <dbl> <chr> ## 1 A01 AD 78 5 e4/e4 ## 2 A02 AD 81 6 e3/e4 ## 3 A03 AD 90 5 e4/e4 ## 4 A04 Control 80 1 e3/e4 ## 5 A05 Control 79 0 e3/e3 ## 6 A06 Control 81 0 e2/e3
dplyr::group_by()
dplyr::group_by()
groups rows together based on condition
dplyr::group_by(metadata, condition )
## # A tibble: 6 × 5 ## # Groups: condition [2] ## ID condition age_at_death Braak_stage APOE_genotype ## <chr> <chr> <dbl> <dbl> <chr> ## 1 A01 AD 78 5 e4/e4 ## 2 A02 AD 81 6 e3/e4 ## 3 A03 AD 90 5 e4/e4 ## 4 A04 Control 80 1 e3/e4 ## 5 A05 Control 79 0 e3/e3 ## 6 A06 Control 81 0 e2/e3
dplyr::summarize()
dplyr::summarize()
(or summarise()
) to compute the mean age at death for each group:
dplyr::group_by(metadata, condition ) %>% dplyr::summarize(mean_age_at_death = mean(age_at_death))
## # A tibble: 2 × 2 ## condition mean_age_at_death ## <chr> <dbl> ## 1 AD 83 ## 2 Control 80
dplyr::group_by(metadata, condition ) %>% dplyr::summarize( mean_age_at_death = mean(age_at_death), sd_age_at_death = sd(age_at_death), lower_age = mean_age_at_death-sd_age_at_death, upper_age = mean_age_at_death+sd_age_at_death, )
## # A tibble: 2 × 5 ## condition mean_age_at_death sd_age_at_death lower_age upper_age ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 AD 83 6.24 76.8 89.2 ## 2 Control 80 1 79 81
summarize()
dplyr::summarize()
has some helper functionsn()
provides the number of rows each group hasdplyr::group_by(metadata, condition ) %>% dplyr::summarize( num_subjects = n(), mean_age_at_death = mean(age_at_death), )
## # A tibble: 2 × 3 ## condition num_subjects mean_age_at_death ## <chr> <int> <dbl> ## 1 AD 3 83 ## 2 Control 3 80
Sometimes the shape and format of our data doesn’t let us summarize easily
Consider our previous example:
gene_stats <- tribble( ~gene, ~test1_stat, ~test1_p, ~test2_stat, ~test2_p, "APOE", 12.509293, 0.1032, 34.239521, 1.3e-5, "HOXD1", 4.399211, 0.6323, 16.332318, 0.0421, "SNCA", 45.748431, 4.2e-9, 0.757188, 0.9146, )
What are the mean and range of our statistic columns?
Could do this manually like so:
tribble( ~test_name, ~min, ~mean, ~max, "test1_stat", min(gene_stats$test1_stat), mean(gene_stats$test1_stat), max(gene_stats$test1_stat), "test2_stat", min(gene_stats$test2_stat), mean(gene_stats$test2_stat), max(gene_stats$test2_stat), )
## # A tibble: 2 × 4 ## test_name min mean max ## <chr> <dbl> <dbl> <dbl> ## 1 test1_stat 4.40 20.9 45.7 ## 2 test2_stat 0.757 17.1 34.2
group_by()
and summarize()
summarize rows!tidyr::pivot_longer()
tidyr::pivot_longer()
tidyr::pivot_longer( gene_stats, c(test1_stat, test2_stat), # columns to pivot names_to="test", values_to="stat" )
## # A tibble: 6 × 5 ## gene test1_p test2_p test stat ## <chr> <dbl> <dbl> <chr> <dbl> ## 1 APOE 0.103 0.000013 test1_stat 12.5 ## 2 APOE 0.103 0.000013 test2_stat 34.2 ## 3 HOXD1 0.632 0.0421 test1_stat 4.40 ## 4 HOXD1 0.632 0.0421 test2_stat 16.3 ## 5 SNCA 0.0000000042 0.915 test1_stat 45.7 ## 6 SNCA 0.0000000042 0.915 test2_stat 0.757
tidyr::pivot_longer()
long_gene_stats <- tidyr::pivot_longer( gene_stats, ends_with("_stat"), # was c(test1_stat, test2_stat), names_to="test", values_to="stat" )
## # A tibble: 6 × 5 ## gene test1_p test2_p test stat ## <chr> <dbl> <dbl> <chr> <dbl> ## 1 APOE 0.103 0.000013 test1_stat 12.5 ## 2 APOE 0.103 0.000013 test2_stat 34.2 ## 3 HOXD1 0.632 0.0421 test1_stat 4.40 ## 4 HOXD1 0.632 0.0421 test2_stat 16.3 ## 5 SNCA 0.0000000042 0.915 test1_stat 45.7 ## 6 SNCA 0.0000000042 0.915 test2_stat 0.757
group_by()
on the test
column and summarize()
on the stat
column:long_gene_stats %>% dplyr::group_by(test) %>% dplyr::summarize( min = min(stat), mean = mean(stat), max = max(stat) )
## # A tibble: 2 × 4 ## test min mean max ## <chr> <dbl> <dbl> <dbl> ## 1 test1_stat 4.40 20.9 45.7 ## 2 test2_stat 0.757 17.1 34.2
tidyr::pivot_wider()
pivot_longer()
is pivot_wider()
pivot_longer()
, reverses process to create tibble with those variables as columnspivot_longer()
followed by pivot_wider()
intensities <- readr::read_csv("example_intensity_data.csv") intensities # A tibble: 54,675 x 36 probe GSM972389 GSM972390 GSM972396 GSM972401 GSM972409 <chr> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1007_s_at 9.54 10.2 9.72 9.68 9.35 2 1053_at 7.62 7.92 7.17 7.24 8.20 3 117_at 5.50 5.56 5.06 7.44 5.19 4 121_at 7.27 7.96 7.42 7.34 7.49 5 1255_g_at 2.79 3.10 2.78 2.91 3.02 # ... with 54,665 more rows, and 26 more variables: GSM972433 <dbl> # GSM972487 <dbl>, GSM972488 <dbl>, GSM972489 <dbl>, # GSM972510 <dbl>, GSM972512 <dbl>, GSM972521 <dbl>
“tidy” data has the following properties:
Data from high throughput biological experiments have many more variables than observations!
Biological data matrices are usually transposed
# A tibble: 54,675 x 36 probe GSM972389 GSM972390 GSM972396 GSM972401 GSM972409 <chr> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1007_s_at 9.54 10.2 9.72 9.68 9.35 2 1053_at 7.62 7.92 7.17 7.24 8.20
Base R and tidyverse are optimized to perform computations on columns not rows
Can perform operations on rows rather than columns, but code may perform poorly
A couple options:
Pivot into long format.
Compute row-wise statistics using apply()
.
intensity_variance <- apply(intensities, 2, var) intensities$variance <- intensity_variance
Bioconductor is itself a package called BiocManager
BiocManager
must be installed prior to installing other Bioconductor packages
To [install bioconductor] (note install.packages()
):
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(version = "3.16")
# installs the affy bioconductor package for microarray analysis BiocManager::install("affy")
In addition, Biconductor provides three types of documentation:
SummarizedExperiment
class stores data and metadata for an experimentSummarizedExperiment
IllustrationSummarizedExperiment
DetailsSummarizedExperiment
class is used ubiquitously throughout the Bioconductor package ecosystemSummarizedExperiment
stores:
assays
)colData
and exptData
)rowData
)