Advanced functionality in R is provided through packages written and supported by R community members
With the exception of bioconductor packages, all R packages are hosted on the Comprehensive R Archive Network (CRAN) web site
There are more than 18,000 packages hosted on CRAN
To install a package from CRAN, use the install.packages
function in the R console:
# install one package install.packages("tidyverse") # install multiple packages install.packages(c("readr","dplyr"))
ENSGNNNNNNNNNNN
, where N
are numbers
ENSG00000130203
ENSG00000130203.10
where .10
indicates this is the 10th updated version of this geneGene ID Prefix | Organism | Symbol | Example |
---|---|---|---|
ENSG |
Homo sapiens | HOXA1 | ENSG00000105991 |
ENSMUSG |
Mus musculus (mouse) | Hoxa1 | ENSMUSG00000029844 |
ENSDARG |
Danio rario (zebrafish) | hoxa1a | ENSDARG00000104307 |
ENSFCAG |
Felis catus (cat) | HOXA1 | ENSFCAG00000007937 |
ENSMMUG |
Macaca mulata (macaque) | HOXA1 | ENSMMUG00000012807 |
384
, in mouse 11816
OMIM:NNNNN
, where each OMIM ID refers to a unique gene or human phenotype
OMIM:107741
P02649
biomaRt
biomaRt
Bioconductor packagebiomaRt
Example# this assumes biomaRt is already installed through bioconductor library(biomaRt) # the human biomaRt database is named "hsapiens_gene_ensembl" ensembl <- useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")
biomaRt
Example# listAttributes() returns a data frame, convert to tibble as_tibble(listAttributes(ensembl)) # A tibble: 3,143 x 3 name description page <chr> <chr> <chr> 1 ensembl_gene_id Gene stable ID feature_page 2 ensembl_gene_id_version Gene stable ID version feature_page 3 ensembl_transcript_id Transcript stable ID feature_page 4 ensembl_transcript_id_version Transcript stable ID version feature_page 5 ensembl_peptide_id Protein stable ID feature_page 6 ensembl_peptide_id_version Protein stable ID version feature_page 7 ensembl_exon_id Exon stable ID feature_page 8 description Gene description feature_page 9 chromosome_name Chromosome/scaffold name feature_page 10 start_position Gene start (bp) feature_page # ... with 3,133 more rows
biomaRt
Examplename
column provides programmatic name associated with the attribute that can be used to retrieve that annotation information using the getBM()
function:# create a tibble with ensembl gene ID, HGNC gene symbol, and gene description gene_map <- as_tibble( getBM( attributes=c("ensembl_gene_id", "hgnc_symbol", "description"), mart=ensembl ) )
biomaRt
Examplegene_map # A tibble: 68,012 x 3 ensembl_gene_id hgnc_symbol description <chr> <chr> <chr> 1 ENSG00000210049 MT-TF mitochondrially encoded tRNA-Phe ... 2 ENSG00000211459 MT-RNR1 mitochondrially encoded 12S rRNA ... 3 ENSG00000210077 MT-TV mitochondrially encoded tRNA-Val ... 4 ENSG00000210082 MT-RNR2 mitochondrially encoded 16S rRNA ... 5 ENSG00000209082 MT-TL1 mitochondrially encoded tRNA-Leu ... 6 ENSG00000198888 MT-ND1 mitochondrially encoded NADH:ubiquinone 7 ENSG00000210100 MT-TI mitochondrially encoded tRNA-Ile ... 8 ENSG00000210107 MT-TQ mitochondrially encoded tRNA-Gln ... 9 ENSG00000210112 MT-TM mitochondrially encoded tRNA-Met ... 10 ENSG00000198763 MT-ND2 mitochondrially encoded NADH:ubiquinone # ... with 68,002 more rows
tribble( ~gene, ~test1_stat, ~test1_p, ~test2_stat, ~test2_p, "APOE", 12.509293, 0.1032, 34.239521, 1.3e-5, "HOXD1", 4.399211, 0.6323, 16.332318, 0.0421, "SNCA", 45.748431, 4.2e-9, 0.757188, 0.9146, )
## # A tibble: 3 × 5 ## gene test1_stat test1_p test2_stat test2_p ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 APOE 12.5 0.103 34.2 0.000013 ## 2 HOXD1 4.40 0.632 16.3 0.0421 ## 3 SNCA 45.7 0.0000000042 0.757 0.915
gene
column in tibble is a gene symbol
Other gene IDs exist for the same gene, e.g. Ensembl Gene IDs, that are associated with other information
Need to match up gene symbol with corresponding Ensembl Gene ID using a known mapping:
gene_map <- tribble( ~symbol, ~ENSGID, ~gene_name, "APOE", "ENSG00000130203", "apolipoprotein E", "BRCA1", "ENSG00000012048", "BRCA1 DNA repair associated", "HOXD1", "ENSG00000128645", "homeobox D1", "SNCA", "ENSG00000145335", "synuclein alpha", )
## # A tibble: 4 × 3 ## symbol ENSGID gene_name ## <chr> <chr> <chr> ## 1 APOE ENSG00000130203 apolipoprotein E ## 2 BRCA1 ENSG00000012048 BRCA1 DNA repair associated ## 3 HOXD1 ENSG00000128645 homeobox D1 ## 4 SNCA ENSG00000145335 synuclein alpha
Term for matching up rows of two tibbles (tables) called joining
We join the gene_stats
tibble with the gene_map
tibble
Match the gene
column with the symbol
column
dplyr::left_join()
function accepts two data frames and the names of columns in each that share common values:
dplyr::left_join( x=gene_stats, y=gene_map, by=c("gene" = "symbol") )
## # A tibble: 3 × 7 ## gene test1_stat test1_p test2_stat test2_p ENSGID gene_name ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 APOE 12.5 0.103 34.2 0.000013 ENSG00000130203 apolipoprot… ## 2 HOXD1 4.40 0.632 16.3 0.0421 ENSG00000128645 homeobox D1 ## 3 SNCA 45.7 0.0000000042 0.757 0.915 ENSG00000145335 synuclein a…
Same thing with pipe:
gene_stats %>% dplyr::left_join( gene_map, by=c("gene" = "symbol") )
## # A tibble: 3 × 7 ## gene test1_stat test1_p test2_stat test2_p ENSGID gene_name ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 APOE 12.5 0.103 34.2 0.000013 ENSG00000130203 apolipoprot… ## 2 HOXD1 4.40 0.632 16.3 0.0421 ENSG00000128645 homeobox D1 ## 3 SNCA 45.7 0.0000000042 0.757 0.915 ENSG00000145335 synuclein a…
The “direction” of the join determines which values appear in output
Left join means include all rows in “left” tibble even if there is no match
Note BRCA1
is in gene_map
but not gene_stats
:
dplyr::left_join( gene_map, # gene_map is "left" gene_stats, # gene_stats is "right" by=c("symbol" = "gene") )
## # A tibble: 4 × 7 ## symbol ENSGID gene_name test1_stat test1_p test2_stat test2_p ## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 APOE ENSG00000130203 apolipoprotein… 12.5 1.03e-1 34.2 1.3 e-5 ## 2 BRCA1 ENSG00000012048 BRCA1 DNA repa… NA NA NA NA ## 3 HOXD1 ENSG00000128645 homeobox D1 4.40 6.32e-1 16.3 4.21e-2 ## 4 SNCA ENSG00000145335 synuclein alpha 45.7 4.20e-9 0.757 9.15e-1
“Right joins” are simply the opposite of left joins:
dplyr::right_join( gene_stats, # gene_stats is "left" gene_map, # gene_map is "right" by=c("gene" = "symbol") )
## # A tibble: 4 × 7 ## gene test1_stat test1_p test2_stat test2_p ENSGID gene_name ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 APOE 12.5 0.103 34.2 0.000013 ENSG00000130203 apolipopr… ## 2 HOXD1 4.40 0.632 16.3 0.0421 ENSG00000128645 homeobox … ## 3 SNCA 45.7 0.0000000042 0.757 0.915 ENSG00000145335 synuclein… ## 4 BRCA1 NA NA NA NA ENSG00000012048 BRCA1 DNA…
Inner joins return results only for rows that have a match between the two tibbles
Recall gene_map
included BRCA1 even though it was not found in gene_stats
Inner join will not include this row, because no match in gene_stats
was found:
dplyr::inner_join( gene_map, gene_stats, by=c("symbol" = "gene") )
## # A tibble: 3 × 7 ## symbol ENSGID gene_name test1_stat test1_p test2_stat test2_p ## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 APOE ENSG00000130203 apolipoprotein E 12.5 1.03e-1 34.2 1.3 e-5 ## 2 HOXD1 ENSG00000128645 homeobox D1 4.40 6.32e-1 16.3 4.21e-2 ## 3 SNCA ENSG00000145335 synuclein alpha 45.7 4.20e-9 0.757 9.15e-1
dplyr::full_join()
(also sometimes called an outer join)orth_map # A tibble: 26,390 x 4 Gene.stable.ID HGNC.symbol Gene.stable.ID.1 MGI.symbol <chr> <chr> <chr> <chr> 1 ENSG00000198695 MT-ND6 ENSMUSG00000064368 "mt-Nd6" 2 ENSG00000212907 MT-ND4L ENSMUSG00000065947 "mt-Nd4l" 3 ENSG00000279169 PRAMEF13 ENSMUSG00000094303 "" 4 ENSG00000279169 PRAMEF13 ENSMUSG00000094722 "" 5 ENSG00000279169 PRAMEF13 ENSMUSG00000095666 "" 6 ENSG00000279169 PRAMEF13 ENSMUSG00000094741 "" 7 ENSG00000279169 PRAMEF13 ENSMUSG00000094836 "" 8 ENSG00000279169 PRAMEF13 ENSMUSG00000074720 "" 9 ENSG00000279169 PRAMEF13 ENSMUSG00000096236 "" 10 ENSG00000198763 MT-ND2 ENSMUSG00000064345 "mt-Nd2" # ... with 26,380 more rows
AnnotationDbi
An effective data visualization:
A great visualization has some additional properties:
ggplot2
ggplot2
Fundamentalsggplot2
ExampleA simple example dataset:
## # A tibble: 20 × 8 ## ID age_at_death condition tau abeta iba1 gfap braak_stage ## <chr> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <fct> ## 1 A1 73 AD 96548 176324 157501 78139 4 ## 2 A2 82 AD 95251 0 147637 79348 4 ## ... ## 10 A10 69 AD 48357 27260 47024 78507 2 ## 11 C1 80 Control 62684 93739 131595 124075 3 ## 12 C2 77 Control 63598 69838 7189 35597 3 ## ... ## 20 C10 73 Control 15781 16592 10271 100858 1
ggplot2
Exampleage_at_death
tau
ggplot( data=ad_metadata, mapping = aes( x = age_at_death, y = tau ) ) + geom_point()
ggplot(data=ad_metadata, mapping=aes(x=age_at_death, y=tau)) + geom_point()
ggplot2
Plot Componentsggplot(data=ad_metadata, mapping=aes(x=age_at_death, y=tau)) + geom_point()
ggplot()
- function creates a plotdata=
- pass a tibble with the datamapping=aes(...)
- Define an aesthetics mapping connecting data to plot propertiesgeom_point(...)
- Specify geometry as points where marks will be made at pairs of x,y coordinatesIs this the whole story?
There are both AD and Control subjects in this dataset!
How does condition
relate to this relationship we see?
Layer on an additional aesthetic of color:
ggplot( data=ad_metadata, mapping = aes( x = age_at_death, y = tau, color=condition # color each point ) ) + geom_point()
ggplot(data=ad_metadata, mapping=aes( x=age_at_death, y=tau, color=condition )) + geom_point()
Differences in distributions of variables can be important
Examine the distribution of age_at_death
for AD and Control samples with violin geometry with geom_violin()
:
ggplot(data=ad_metadata, mapping = aes(x=condition, y=age_at_death)) + geom_violin()
ggplot(data=ad_metadata, mapping = aes(x=condition, y=age_at_death)) + geom_violin()
patchwork
library:library(patchwork) age_boxplot <- ggplot( data=ad_metadata, mapping = aes(x=condition, y=age_at_death) ) + geom_boxplot() tau_boxplot <- ggplot( data=ad_metadata, mapping=aes(x=condition, y=tau) ) + geom_boxplot() age_boxplot | tau_boxplot # this puts the plots side by side
age_boxplot | tau_boxplot # this puts the plots side by side