add <- function(x,y) {
return(x+y)
}
Can test this manually:
result <- add(1,2) result == 3 [1] TRUE
add <- function(x,y) {
return(x+y+1)
}
Test FAILS:
result <- add(1,2) result == 3 [1] FALSE
testthat Packageinstall.packages("testthat")testthat Unit Testlibrary(testthat)
test_that("add() correctly adds things", {
expect_equal(add(1,2), 3)
expect_equal(add(5,6), 11)
}
)
Test passed
test_that Anatomytest_that("add() correctly adds things", {
expect_equal(add(1,2), 3)
expect_equal(add(5,6), 11)
}
)
add() correctly adds things){} written using expect_X functions from the testthat packagetest_that Failurestest_that("add() correctly adds things", {
expect_equal(add(1,2), 3)
expect_equal(add(5,6), 10) # incorrect expected value
}
)
-- Failure (Line 3): add() correctly adds things -------------------------------
add(5, 6) not equal to 10.
1/1 mismatches
[1] 11 - 10 == 1
Error: Test failed
test_functions.Rtest_file()add <- function(x,y) {
return(x+y)
}
testthat::test_file("test_functions.R")
== Testing test_functions.R =======================================================
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 2 ] Done!
testthat tests calling functions that are not defined will fail like tests that find incorrect outputtest_that("mul() correctly multiplies things",{
expect_equal(mul(1,2), 2)
})
-- Error (Line 1): new function ------------------------------------------------
Error in `mul(1, 2)`: could not find function "mul"
Backtrace:
1. testthat::expect_equal(mul(1, 2), 2)
2. testthat::quasi_label(enquo(object), label, arg = "object")
3. rlang::eval_bare(expr, quo_get_env(quo))
Error: Test failed
Each assignment has similar format and workflow, files:
├── reference_report.html ├── main.R ├── README.md ├── report.Rmd └── test_main.R
tidyverse is a set of packages that work together
Load the most common tidyverse packages at the same time:
library(tidyverse) -- Attaching packages ---------- tidyverse 1.3.1 -- v ggplot2 3.3.5 v purrr 0.3.4 v tibble 3.1.6 v dplyr 1.0.7 v tidyr 1.1.4 v stringr 1.4.0 v readr 2.1.1 v forcats 0.5.1 -- Conflicts ------------- tidyverse_conflicts() -- x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag()
| Package | Description |
|---|---|
| ggplot2 | Plotting using the grammar of graphics |
| tibble | Simple and sane data frames |
| tidyr | Operations for making data “tidy” |
| readr | Read rectangular text data into tidyverse |
| purrr | Functional programming tools for tidyverse |
| dplyr | “A Grammar of Data Manipulation” |
| stringr | Makes working with strings in R easier |
| forcats | Operations for using categorical variables |
| Function | Brief description/use |
|---|---|
read_csv |
Delimiter: , - Decimal separator: . |
read_csv2 |
Delimiter: ; - Decimal separator: , |
read_tsv |
Delimiter: <tab> - Decimal separator: . |
read_delim |
Delimiter: set by user - Decimal separator: . |
readr functions operate on zipped filesreadr file reading functions can read compressed files directly, so you do not need to decompress them firsreadr also has functions for writing delimited files:| Function | Brief description/use |
|---|---|
write_csv |
Delimiter: , - Decimal separator: . |
write_csv2 |
Delimiter: ; - Decimal separator: , |
write_tsv |
Delimiter: <tab> - Decimal separator: . |
write_delim |
Delimiter: set by user - Decimal separator: . |
csv - character separated value file.csvid,somevalue,category,genes 1,2.123,A,APOE 4,5.123,B,"HOXA1,HOXB1" 7,8.123,,SNCA
,,), the value is wrapped in double quotes,, or one , at the end of the line, if the last column value is missing)tibble
library(tibble)
tbl <- tibble(
x = rnorm(100, mean=20, sd=10),
y = rnorm(100, mean=50, sd=5)
)
tbl
# A tibble: 100 x 2
x y
1 16.5 54.6
2 14.4 54.3
# ... with 98 more rows
tibble stores rectangular data that can be accessed like a data frame:
tbl$x
[1] 29.57249 12.01577 15.25536 23.07761 32.25403 48.04651 21.90576
[8] 15.51168 34.87285 21.32433 12.51230 23.60896 6.77630 12.34223
...
tbl[1,"x"] # access the first element of x
# A tibble: 1 x 1
x
1 29.6
tbl$x[1]
[1] 29.57255
tibbles (and regular data frames) typically have names for their columns accessed with the colnames function
tbl <- tibble(
x = rnorm(100, mean=20, sd=10),
y = rnorm(100, mean=50, sd=5)
)
colnames(tbl)
[1] "x" "y"
colnames(tbl) <- c("a","b")
tbl
# A tibble: 100 x 2
a b
1 16.5 54.6
2 14.4 54.3
# ... with 90 more rows
dplyr::rename to rename columns as well:
dplyr::rename(tbl,
a = x,
b = y
)
tbl <- tibble(
x = rnorm(100, mean=20, sd=10),
y = rnorm(100, mean=50, sd=5)
)
rownames(tbl)
[1] "1" "2" "3"...
tibble support for row names is only included for compatibility with base R data framestribble() - Constructing tibbles by handtribble() function:gene_stats <- tribble(
~gene, ~test1_stat, ~test1_p, ~test2_stat, ~test2_p,
"hoxd1", 12.509293, 0.1032, 34.239521, 1.3e-5,
"brca1", 4.399211, 0.6323, 16.332318, 0.0421,
"brca2", 9.672011, 0.9323, 12.689219, 0.0621,
"snca", 45.748431, 4.2e-4, 0.757188, 0.9146,
)
tidyverse packages designed to operate with so-called “tidy data”
the following rules make data tidy:
a variable is a quantity or property that every observation in our dataset has
each observation is a separate instance of those variable
gene_stats
## # A tibble: 4 × 5 ## gene test1_stat test1_p test2_stat test2_p ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 hoxd1 12.5 0.103 34.2 0.000013 ## 2 brca1 4.40 0.632 16.3 0.0421 ## 3 brca2 9.67 0.932 12.7 0.0621 ## 4 snca 45.7 0.00042 0.757 0.915
%>%In base R we would need to do this with assignments
# data_file.csv has two columns: bad_cOlumn_name and numeric_column
data <- readr::read_csv("data_file.csv")
data <- dplyr::rename(data, "better_column_name"=bad_cOlumn_name)
data <- dplyr::filter(data, better_column_name %in% c("condA","condB"))
data_grouped <- dplyr::group_by(data, better_column_name)
summarized <- dplyr::summarize(data_grouped, mean(numeric_column))%>%A key tidyverse programming pattern is chaining manipulations of tibbles together by passing the result of one manipulation as input to the next
Tidyverse defines the %>% operator to do this:
data <- readr::read_csv("data_file.csv") %>%
dplyr::rename("better_column_name"=bad_cOlumn_name) %>%
dplyr::filter(better_column_name %in% c("condA","condB")) %>%
dplyr::group_by(better_column_name) %>%
dplyr::summarize(mean(numeric_column))The %>% operator passes the result of the function immediately preceding it as the first argument to the next function automatically
dplyr packagedplyr::mutate()dplyr::mutate() - Create new columns using other columnsp.adjust function can perform several of these procedures, including FDRgene_stats
## # A tibble: 4 × 5 ## gene test1_stat test1_p test2_stat test2_p ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 hoxd1 12.5 0.103 34.2 0.000013 ## 2 brca1 4.40 0.632 16.3 0.0421 ## 3 brca2 9.67 0.932 12.7 0.0621 ## 4 snca 45.7 0.00042 0.757 0.915
dplyr::mutate() - FDR examplegene_stats <- dplyr::mutate(gene_stats, test1_padj=p.adjust(test1_p,method="fdr") ) gene_stats
## # A tibble: 4 × 6 ## gene test1_stat test1_p test2_stat test2_p test1_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 hoxd1 12.5 0.103 34.2 0.000013 0.206 ## 2 brca1 4.40 0.632 16.3 0.0421 0.843 ## 3 brca2 9.67 0.932 12.7 0.0621 0.932 ## 4 snca 45.7 0.00042 0.757 0.915 0.00168
dplyr::mutate() - FDR exampleYou can create multiple columns in the same call to mutate():
gene_stats <- dplyr::mutate(gene_stats, test1_padj=p.adjust(test1_p,method="fdr"), test2_padj=p.adjust(test2_p,method="fdr") ) gene_stats
## # A tibble: 4 × 7 ## gene test1_stat test1_p test2_stat test2_p test1_padj test2_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 hoxd1 12.5 0.103 34.2 0.000013 0.206 0.000052 ## 2 brca1 4.40 0.632 16.3 0.0421 0.843 0.0828 ## 3 brca2 9.67 0.932 12.7 0.0621 0.932 0.0828 ## 4 snca 45.7 0.00042 0.757 0.915 0.00168 0.915
dplyr::mutate() - Deriving from multiple columnsHere we create a column with TRUE or FALSE if either or both of the adjusted p-values are less than \(0.05\):
gene_stats <- dplyr::mutate(gene_stats, signif_either=(test1_padj < 0.05 | test2_padj < 0.05), signif_both=(test1_padj < 0.05 & test2_padj < 0.05) )
| and & operators execute ‘or’ and ‘and’ logic, respectively
dplyr::mutate() - using derived columnsmutate() call can be used in subsequent column definitions:gene_stats <- dplyr::mutate(gene_stats, test1_padj=p.adjust(test1_p,method="fdr"), # test1_padj created test2_padj=p.adjust(test2_p,method="fdr"), signif_either=(test1_padj < 0.05 | test2_padj < 0.05), #test1_padj used signif_both=(test1_padj < 0.05 & test2_padj < 0.05) )
dplyr::mutate() - Modifying columnsmutate() can also be used to modify columns in place
Example replaces the values in the gene column with upper cased values
dplyr::mutate(gene_stats, gene=stringr::str_to_upper(gene) )
## # A tibble: 4 × 9 ## gene test1_stat test1_p test2_stat test2_p test1_padj test2_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 HOXD1 12.5 0.103 34.2 0.000013 0.206 0.000052 ## 2 BRCA1 4.40 0.632 16.3 0.0421 0.843 0.0828 ## 3 BRCA2 9.67 0.932 12.7 0.0621 0.932 0.0828 ## 4 SNCA 45.7 0.00042 0.757 0.915 0.00168 0.915 ## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>
dplyr::filter()dplyr::filter() - Pick rows out of a data setgene_stats
## # A tibble: 4 × 9 ## gene test1_stat test1_p test2_stat test2_p test1_padj test2_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 HOXD1 12.5 0.103 34.2 0.000013 0.206 0.000052 ## 2 BRCA1 4.40 0.632 16.3 0.0421 0.843 0.0828 ## 3 BRCA2 9.67 0.932 12.7 0.0621 0.932 0.0828 ## 4 SNCA 45.7 0.00042 0.757 0.915 0.00168 0.915 ## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>
dplyr::filter() - Filter based on text valuesfilter() on the data frame to look for the genes BRCA1 and BRCA2gene_stats %>% filter(gene == "BRCA1" | gene == "BRCA2")
## # A tibble: 2 × 9 ## gene test1_stat test1_p test2_stat test2_p test1_padj test2_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 BRCA1 4.40 0.632 16.3 0.0421 0.843 0.0828 ## 2 BRCA2 9.67 0.932 12.7 0.0621 0.932 0.0828 ## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>
dplyr::filter() - Filter based on numeric valuesfilter() to restrict genes to those that are significant at padj < 0.01gene_stats %>% filter(test1_padj < 0.01)
## # A tibble: 1 × 9 ## gene test1_stat test1_p test2_stat test2_p test1_padj test2_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 SNCA 45.7 0.00042 0.757 0.915 0.00168 0.915 ## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>
stringr - Working with character valuesstringr - Working with character valuesstringr - Working with character valuesstringr Example: upper caseThe function stringr::str_to_upper() with the dplyr::mutate() function to cast an existing column to upper case
dplyr::mutate(gene_stats, gene=stringr::str_to_upper(gene) )
## # A tibble: 4 × 9 ## gene test1_stat test1_p test2_stat test2_p test1_padj test2_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 HOXD1 12.5 0.103 34.2 0.000013 0.206 0.000052 ## 2 BRCA1 4.40 0.632 16.3 0.0421 0.843 0.0828 ## 3 BRCA2 9.67 0.932 12.7 0.0621 0.932 0.0828 ## 4 SNCA 45.7 0.00042 0.757 0.915 0.00168 0.915 ## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>
stringr package use regular expression syntaxgene_stats
## # A tibble: 4 × 9 ## gene test1_stat test1_p test2_stat test2_p test1_padj test2_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 HOXD1 12.5 0.103 34.2 0.000013 0.206 0.000052 ## 2 BRCA1 4.40 0.632 16.3 0.0421 0.843 0.0828 ## 3 BRCA2 9.67 0.932 12.7 0.0621 0.932 0.0828 ## 4 SNCA 45.7 0.00042 0.757 0.915 0.00168 0.915 ## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>
filter() to look for literal “BRCA1” and “BRCA2”stringr::str_detect() returns TRUE if the provided pattern matches the input and FALSE otherwisestringr::str_detect(c("HOX1A","BRCA1","BRCA2","SNCA"), "^BRCA[12]$")
## [1] FALSE TRUE TRUE FALSE
dplyr::filter(gene_stats, str_detect(gene,"^BRCA[12]$"))
## # A tibble: 2 × 9 ## gene test1_stat test1_p test2_stat test2_p test1_padj test2_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 BRCA1 4.40 0.632 16.3 0.0421 0.843 0.0828 ## 2 BRCA2 9.67 0.932 12.7 0.0621 0.932 0.0828 ## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>
dplyr::filter(gene_stats, str_detect(gene,"^BRCA[12]$"))
The argument "^BRCA[12]$" is a regular expression that searches for the following:
BRCA (^BRCA)1 or 2 ([12])$). - match any single character* - match zero or more of the character immediately preceding the * "", "1", "11", "111", ...+ - match one or more of the character immediately preceding the * "1", "11", "111", ...[XYZ] - match one of any of the characters between [] (e.g. X, Y, or Z)^, $ - match the beginning and end of the string, respectivelydplyr::select()dplyr::select() - Subset Columns by Namedplyr::select() function picks columns out of a tibble:stats <- dplyr::select(gene_stats, test1_stat, test2_stat) stats
## # A tibble: 4 × 2 ## test1_stat test2_stat ## <dbl> <dbl> ## 1 12.5 34.2 ## 2 4.40 16.3 ## 3 9.67 12.7 ## 4 45.7 0.757
dplyr::select() - Helper functionsdplyr also has “helper functions” for more flexible selection of columns_stat, we could use the ends_with() helper function:stats <- dplyr::select(gene_stats, ends_with("_stat"))
stats
## # A tibble: 4 × 2 ## test1_stat test2_stat ## <dbl> <dbl> ## 1 12.5 34.2 ## 2 4.40 16.3 ## 3 9.67 12.7 ## 4 45.7 0.757
dplyr::select() - Renaming columnsselect() allows for the renaming of selected columns:stats <- dplyr::select(gene_stats, t=test1_stat, chisq=test2_stat ) stats
## # A tibble: 4 × 2 ## t chisq ## <dbl> <dbl> ## 1 12.5 34.2 ## 2 4.40 16.3 ## 3 9.67 12.7 ## 4 45.7 0.757
dplyr::select() - Reorder columnsdplyr::select(gene_stats,
gene,
test1_stat, test1_p, test1_padj,
test2_stat, test2_p, test2_padj,
signif_either,
signif_both
)
## # A tibble: 4 × 9 ## gene test1_stat test1_p test1_padj test2_stat test2_p test2_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 HOXD1 12.5 0.103 0.206 34.2 0.000013 0.000052 ## 2 BRCA1 4.40 0.632 0.843 16.3 0.0421 0.0828 ## 3 BRCA2 9.67 0.932 0.932 12.7 0.0621 0.0828 ## 4 SNCA 45.7 0.00042 0.00168 0.757 0.915 0.915 ## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>
dplyr::arrange()dplyr::arrange() - Order rows based on their valuesdplyr::arrange(gene_stats, test1_p)
## # A tibble: 4 × 9 ## gene test1_stat test1_p test2_stat test2_p test1_padj test2_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 SNCA 45.7 0.00042 0.757 0.915 0.00168 0.915 ## 2 HOXD1 12.5 0.103 34.2 0.000013 0.206 0.000052 ## 3 BRCA1 4.40 0.632 16.3 0.0421 0.843 0.0828 ## 4 BRCA2 9.67 0.932 12.7 0.0621 0.932 0.0828 ## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>
dplyr::arrange() - Changing sort directiondplyr::arrange() sort ascending by defaultdesc() helper function# desc() sorts descending dplyr::arrange(gene_stats, desc(abs(test1_stat)))
## # A tibble: 4 × 9 ## gene test1_stat test1_p test2_stat test2_p test1_padj test2_padj ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 SNCA 45.7 0.00042 0.757 0.915 0.00168 0.915 ## 2 HOXD1 12.5 0.103 34.2 0.000013 0.206 0.000052 ## 3 BRCA2 9.67 0.932 12.7 0.0621 0.932 0.0828 ## 4 BRCA1 4.40 0.632 16.3 0.0421 0.843 0.0828 ## # ℹ 2 more variables: signif_either <lgl>, signif_both <lgl>
In the previous sections, we performed the following operations:
dplyr::mutate() and p.adjust functionsdplyr::mutate()stringr::str_to_upper and dplyr::mutate()select()dplyr::filter()dplyr::arrange()gene_stats <- dplyr::mutate(gene_stats,
test1_padj=p.adjust(test1_p,method="fdr"),
test2_padj=p.adjust(test2_p,method="fdr"),
signif_either=(test1_padj < 0.05 | test2_padj < 0.05),
signif_both=(test1_padj < 0.05 & test2_padj < 0.05),
gene=stringr::str_to_upper(gene)
) %>%
dplyr::select(
gene,
test1_stat, test1_p, test1_padj,
test2_stat, test2_p, test2_padj,
signif_either,
signif_both
) %>%
dplyr::filter( test1_padj < 0.05 | test2_padj < 0.05 ) %>%
dplyr::arrange( test1_p )
Grouping data together allow us to summarize them
Consider:
metadata <- tribble(
~ID, ~condition, ~age_at_death, ~Braak_stage, ~APOE_genotype,
"A01", "AD", 78, 5, "e4/e4",
"A02", "AD", 81, 6, "e3/e4",
"A03", "AD", 90, 5, "e4/e4",
"A04", "Control", 80, 1, "e3/e4",
"A05", "Control", 79, 0, "e3/e3",
"A06", "Control", 81, 0, "e2/e3"
)metadata
## # A tibble: 6 × 5 ## ID condition age_at_death Braak_stage APOE_genotype ## <chr> <chr> <dbl> <dbl> <chr> ## 1 A01 AD 78 5 e4/e4 ## 2 A02 AD 81 6 e3/e4 ## 3 A03 AD 90 5 e4/e4 ## 4 A04 Control 80 1 e3/e4 ## 5 A05 Control 79 0 e3/e3 ## 6 A06 Control 81 0 e2/e3
dplyr::group_by()dplyr::group_by() groups rows together based on condition
dplyr::group_by(metadata, condition )
## # A tibble: 6 × 5 ## # Groups: condition [2] ## ID condition age_at_death Braak_stage APOE_genotype ## <chr> <chr> <dbl> <dbl> <chr> ## 1 A01 AD 78 5 e4/e4 ## 2 A02 AD 81 6 e3/e4 ## 3 A03 AD 90 5 e4/e4 ## 4 A04 Control 80 1 e3/e4 ## 5 A05 Control 79 0 e3/e3 ## 6 A06 Control 81 0 e2/e3
dplyr::summarize()dplyr::summarize() (or summarise()) to compute the mean age at death for each group:
dplyr::group_by(metadata, condition ) %>% dplyr::summarize(mean_age_at_death = mean(age_at_death))
## # A tibble: 2 × 2 ## condition mean_age_at_death ## <chr> <dbl> ## 1 AD 83 ## 2 Control 80
dplyr::group_by(metadata, condition ) %>% dplyr::summarize( mean_age_at_death = mean(age_at_death), sd_age_at_death = sd(age_at_death), lower_age = mean_age_at_death-sd_age_at_death, upper_age = mean_age_at_death+sd_age_at_death, )
## # A tibble: 2 × 5 ## condition mean_age_at_death sd_age_at_death lower_age upper_age ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 AD 83 6.24 76.8 89.2 ## 2 Control 80 1 79 81
summarize()dplyr::summarize() has some helper functionsn() provides the number of rows each group hasdplyr::group_by(metadata, condition ) %>% dplyr::summarize( num_subjects = n(), mean_age_at_death = mean(age_at_death), )
## # A tibble: 2 × 3 ## condition num_subjects mean_age_at_death ## <chr> <int> <dbl> ## 1 AD 3 83 ## 2 Control 3 80
Sometimes the shape and format of our data doesn’t let us summarize easily
Consider our previous example:
gene_stats <- tribble(
~gene, ~test1_stat, ~test1_p, ~test2_stat, ~test2_p,
"APOE", 12.509293, 0.1032, 34.239521, 1.3e-5,
"HOXD1", 4.399211, 0.6323, 16.332318, 0.0421,
"SNCA", 45.748431, 4.2e-9, 0.757188, 0.9146,
)What are the mean and range of our statistic columns?
Could do this manually like so:
tribble(
~test_name, ~min, ~mean, ~max,
"test1_stat", min(gene_stats$test1_stat),
mean(gene_stats$test1_stat), max(gene_stats$test1_stat),
"test2_stat", min(gene_stats$test2_stat),
mean(gene_stats$test2_stat), max(gene_stats$test2_stat),
)
## # A tibble: 2 × 4 ## test_name min mean max ## <chr> <dbl> <dbl> <dbl> ## 1 test1_stat 4.40 20.9 45.7 ## 2 test2_stat 0.757 17.1 34.2
group_by() and summarize() summarize rows!tidyr::pivot_longer()tidyr::pivot_longer()tidyr::pivot_longer( gene_stats, c(test1_stat, test2_stat), # columns to pivot names_to="test", values_to="stat" )
## # A tibble: 6 × 5 ## gene test1_p test2_p test stat ## <chr> <dbl> <dbl> <chr> <dbl> ## 1 APOE 0.103 0.000013 test1_stat 12.5 ## 2 APOE 0.103 0.000013 test2_stat 34.2 ## 3 HOXD1 0.632 0.0421 test1_stat 4.40 ## 4 HOXD1 0.632 0.0421 test2_stat 16.3 ## 5 SNCA 0.0000000042 0.915 test1_stat 45.7 ## 6 SNCA 0.0000000042 0.915 test2_stat 0.757
tidyr::pivot_longer()long_gene_stats <- tidyr::pivot_longer(
gene_stats,
ends_with("_stat"), # was c(test1_stat, test2_stat),
names_to="test",
values_to="stat"
)
## # A tibble: 6 × 5 ## gene test1_p test2_p test stat ## <chr> <dbl> <dbl> <chr> <dbl> ## 1 APOE 0.103 0.000013 test1_stat 12.5 ## 2 APOE 0.103 0.000013 test2_stat 34.2 ## 3 HOXD1 0.632 0.0421 test1_stat 4.40 ## 4 HOXD1 0.632 0.0421 test2_stat 16.3 ## 5 SNCA 0.0000000042 0.915 test1_stat 45.7 ## 6 SNCA 0.0000000042 0.915 test2_stat 0.757
group_by() on the test column and summarize() on the stat column:long_gene_stats %>%
dplyr::group_by(test) %>%
dplyr::summarize(
min = min(stat), mean = mean(stat), max = max(stat)
)
## # A tibble: 2 × 4 ## test min mean max ## <chr> <dbl> <dbl> <dbl> ## 1 test1_stat 4.40 20.9 45.7 ## 2 test2_stat 0.757 17.1 34.2
tidyr::pivot_wider()pivot_longer() is pivot_wider()pivot_longer(), reverses process to create tibble with those variables as columnspivot_longer() followed by pivot_wider()intensities <- readr::read_csv("example_intensity_data.csv")
intensities
# A tibble: 54,675 x 36
probe GSM972389 GSM972390 GSM972396 GSM972401 GSM972409
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1007_s_at 9.54 10.2 9.72 9.68 9.35
2 1053_at 7.62 7.92 7.17 7.24 8.20
3 117_at 5.50 5.56 5.06 7.44 5.19
4 121_at 7.27 7.96 7.42 7.34 7.49
5 1255_g_at 2.79 3.10 2.78 2.91 3.02
# ... with 54,665 more rows, and 26 more variables: GSM972433 <dbl>
# GSM972487 <dbl>, GSM972488 <dbl>, GSM972489 <dbl>,
# GSM972510 <dbl>, GSM972512 <dbl>, GSM972521 <dbl>
“tidy” data has the following properties:
Data from high throughput biological experiments have many more variables than observations!
Biological data matrices are usually transposed
# A tibble: 54,675 x 36 probe GSM972389 GSM972390 GSM972396 GSM972401 GSM972409 <chr> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1007_s_at 9.54 10.2 9.72 9.68 9.35 2 1053_at 7.62 7.92 7.17 7.24 8.20
Base R and tidyverse are optimized to perform computations on columns not rows
Can perform operations on rows rather than columns, but code may perform poorly
A couple options:
Pivot into long format.
Compute row-wise statistics using apply().
intensity_variance <- apply(intensities, 2, var) intensities$variance <- intensity_variance
Bioconductor is itself a package called BiocManager
BiocManager must be installed prior to installing other Bioconductor packages
To [install bioconductor] (note install.packages()):
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.16")# installs the affy bioconductor package for microarray analysis
BiocManager::install("affy")
In addition, Biconductor provides three types of documentation:
SummarizedExperiment class stores data and metadata for an experimentSummarizedExperiment Illustration
SummarizedExperiment DetailsSummarizedExperiment class is used ubiquitously throughout the Bioconductor package ecosystemSummarizedExperiment stores:
assays)colData and exptData)rowData)