styler
packagescale()
As mentioned in our section on types of biological data, we often need to combine different sources of data together to aid in interpretation of our results. Below we redefine the tibble of gene statistics from above to have properly capitalized gene symbols:
<- tribble(
gene_stats ~gene, ~test1_stat, ~test1_p, ~test2_stat, ~test2_p,
"APOE", 12.509293, 0.1032, 34.239521, 1.3e-5,
"HOXD1", 4.399211, 0.6323, 16.332318, 0.0421,
"SNCA", 45.748431, 4.2e-9, 0.757188, 0.9146,
) gene_stats
## # A tibble: 3 x 5
## gene test1_stat test1_p test2_stat test2_p
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 APOE 12.5 0.103 34.2 0.000013
## 2 HOXD1 4.40 0.632 16.3 0.0421
## 3 SNCA 45.7 0.0000000042 0.757 0.915
Our gene identifiers in this data frame are gene
symbols which,
while convenient for our human brains to remember, can change over time and have
many aliases (e.g. the APOE gene has also been called AD2, LDLCQ5, APO-E, and
ApoE4 as listed on its
genecard). This can
make writing code that refers to a specific gene difficult, since there are so
many possible names to look for. Fortunately, there are alternative gene
identifier systems that do a better job of maintaining stable, consistent gene
identifiers, one of the most popular being Ensembl.
Ensembl gene IDs always take the form ENSGNNNNNNNNNNN
where N
s are digits.
These IDs are much more stable and predictable than gene symbols, and are
preferable when working with genes in code.
We wish to add Ensembl IDs for the genes in our gene_stats
result to the
tibble as a new column. Now suppose we have obtained another file with cross
referenced gene identifiers like the following:
<- tribble(
gene_map ~symbol, ~ENSGID, ~gene_name,
"APOE", "ENSG00000130203", "apolipoprotein E",
"BRCA1", "ENSG00000012048", "BRCA1 DNA repair associated",
"HOXD1", "ENSG00000128645", "homeobox D1",
"SNCA", "ENSG00000145335", "synuclein alpha",
) gene_map
## # A tibble: 4 x 3
## symbol ENSGID gene_name
## <chr> <chr> <chr>
## 1 APOE ENSG00000130203 apolipoprotein E
## 2 BRCA1 ENSG00000012048 BRCA1 DNA repair associated
## 3 HOXD1 ENSG00000128645 homeobox D1
## 4 SNCA ENSG00000145335 synuclein alpha
Imagine that this file contains mappings for all ~60,000 genes in the human
genome. While it might be simple to look up our three genes in this file and
annotate manually, it is easier to ask dplyr
to do it for us. We can do this
using the dplyr::left_join()
function which accepts two data frames and the
names of columns in each that share common values:
::left_join(
dplyrx=gene_stats,
y=gene_map,
by=c("gene" = "symbol")
)
## # A tibble: 3 x 7
## gene test1_stat test1_p test2_stat test2_p ENSGID gene_name
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 APOE 12.5 0.103 34.2 0.000013 ENSG00000130203 apolipoprot~
## 2 HOXD1 4.40 0.632 16.3 0.0421 ENSG00000128645 homeobox D1
## 3 SNCA 45.7 0.0000000042 0.757 0.915 ENSG00000145335 synuclein a~
Notice that the additional columns in gene_map
that were not involved in the
join (i.e. ENSG
and gene_name
) are appended to the gene_stats
tibble. If
we wanted to use pipes, we could implement the same join above as follows:
%>% dplyr::left_join(
gene_stats
gene_map,by=c("gene" = "symbol")
)
## # A tibble: 3 x 7
## gene test1_stat test1_p test2_stat test2_p ENSGID gene_name
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 APOE 12.5 0.103 34.2 0.000013 ENSG00000130203 apolipoprot~
## 2 HOXD1 4.40 0.632 16.3 0.0421 ENSG00000128645 homeobox D1
## 3 SNCA 45.7 0.0000000042 0.757 0.915 ENSG00000145335 synuclein a~
Under the hood, the dplyr::left_join()
function took the values in
gene_stats$gene
and looked for the corresponding row in gene_map
with the
same value in in the symbol
column. It then appends all the additional columns
of gene_map
for the matching rows and returns the result. We have added the
identifier mapping we desired with only a few lines of code. And what’s more,
this code will work no matter how many genes we had in gene_stats
as long as
gene_map
contains a mapping value for the values in gene_stats$gene
.
But what happens if we have genes in gene_stats
that don’t exist in our
mapping? In the above example, we use a left join because we want to include
all the rows in gene_stats
regardless of whether a mapping exists in
gene_map
. In this case this was fine because all of our genes in gene_stats
had a corresponding row in the mapping. However, notice that in the mapping
there is an additional gene, BRCA1 that
is not in our gene statistics tibble. If we reverse the order of the join,
observe what happens:
%>% dplyr::left_join(
gene_map
gene_stats,by=c("symbol" = "gene")
)
## # A tibble: 4 x 7
## symbol ENSGID gene_name test1_stat test1_p test2_stat test2_p
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 APOE ENSG00000130203 apolipoprotein~ 12.5 1.03e-1 34.2 1.3 e-5
## 2 BRCA1 ENSG00000012048 BRCA1 DNA repa~ NA NA NA NA
## 3 HOXD1 ENSG00000128645 homeobox D1 4.40 6.32e-1 16.3 4.21e-2
## 4 SNCA ENSG00000145335 synuclein alpha 45.7 4.2 e-9 0.757 9.15e-1
Now the order of the rows is the same as in gene_map
, and the columns for our
missing gene BRCA1 are filled with NA
. This is the left join at work, where
the record from gene_map
is included regardless of whether a corresponding
value was found in gene_stats
.
There are additional types of joins besides left joins. Right joins are simply the opposite of left joins:
%>% dplyr::right_join(
gene_stats
gene_map,by=c("gene" = "symbol")
)
## # A tibble: 4 x 7
## gene test1_stat test1_p test2_stat test2_p ENSGID gene_name
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 APOE 12.5 0.103 34.2 0.000013 ENSG00000130203 apolipopr~
## 2 HOXD1 4.40 0.632 16.3 0.0421 ENSG00000128645 homeobox ~
## 3 SNCA 45.7 0.0000000042 0.757 0.915 ENSG00000145335 synuclein~
## 4 BRCA1 NA NA NA NA ENSG00000012048 BRCA1 DNA~
The result is the same as the left join on gene_map
except the order of the
resulting columns is different.
Inner joins return results for rows that have a match between the two tibbles.
Recall our left join on gene_map
included BRCA1 even though it was not found
in gene_stats
. An inner join will not include this row, because no match in
gene_stats
was found:
%>% dplyr::inner_join(
gene_map
gene_stats,by=c("symbol" = "gene")
)
## # A tibble: 3 x 7
## symbol ENSGID gene_name test1_stat test1_p test2_stat test2_p
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 APOE ENSG00000130203 apolipoprotein E 12.5 1.03e-1 34.2 1.3 e-5
## 2 HOXD1 ENSG00000128645 homeobox D1 4.40 6.32e-1 16.3 4.21e-2
## 3 SNCA ENSG00000145335 synuclein alpha 45.7 4.2 e-9 0.757 9.15e-1
One last type of join is the
dplyr::full_join()
(also sometimes called an outer join). As you may expect, a full join will
return all rows from both tibbles whether a match in the other table was found
or not.
In the example above, there was a one-to-one relationship between the gene symbols in both tibbles. This made the joined tibble tidy. However, when a one-to-many relationship exists, i.e. one gene symbol in one tibble has multiple rows in the other, this can lead to what appears to be duplicate rows in the joined result. Due to the relative instability of gene symbols, it is very common to have multiple Ensembl genes associated with a single gene symbol. The following takes a gene mapping of Ensembl IDs to gene symbols and identifies cases where multiple Ensembl IDs map to a single gene symbol:
::read_tsv("mart_export.tsv") %>%
readr::filter(
dplyr`HGNC symbol` != "NA" & # many unstudied genes have Ensembl IDs but no official symbol
`HGNC symbol` %in% `HGNC symbol`[duplicated(`HGNC symbol`)]) %>%
::arrange(`HGNC symbol`) dplyr
## # A tibble: 7,268 x 3
## `Gene stable ID` `HGNC symbol` `Gene name`
## <chr> <chr> <chr>
## 1 ENSG00000261846 AADACL2 arylacetamide deacetylase like 2
## 2 ENSG00000197953 AADACL2 arylacetamide deacetylase like 2
## 3 ENSG00000262466 AADACL2-AS1 <NA>
## 4 ENSG00000242908 AADACL2-AS1 <NA>
## 5 ENSG00000276072 AATF apoptosis antagonizing transcription factor
## 6 ENSG00000275700 AATF apoptosis antagonizing transcription factor
## 7 ENSG00000281173 ABCB10P1 <NA>
## 8 ENSG00000280461 ABCB10P1 <NA>
## 9 ENSG00000274099 ABCB10P1 <NA>
## 10 ENSG00000282479 ABCB10P3 <NA>
## # ... with 7,258 more rows
There are over 7,000 Ensembl IDs that map to the same gene symbol as another
Ensembl ID. That is more than 10% of all Ensembl IDs. So now let’s create a new
gene_stats
tibble with one of these gene symbols and join with the map to
see what happens:
<- readr::read_tsv("mart_export.tsv")
gene_map <- tribble(
gene_stats ~gene, ~test1_stat, ~test1_p, ~test2_stat, ~test2_p,
"APOE", 12.509293, 0.1032, 34.239521, 1.3e-5,
"HOXD1", 4.399211, 0.6323, 16.332318, 0.0421,
"SNCA", 45.748431, 4.2e-9, 0.757188, 0.9146,
"DEAF1", 0.000000, 1.0, 0, 1.0
%>% left_join(gene_map, by=c("gene" = "HGNC symbol"))
) gene_stats
## # A tibble: 5 x 7
## gene test1_stat test1_p test2_stat test2_p `Gene stable ID` `Gene name`
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 APOE 12.5 0.103 34.2 0.000013 ENSG00000130203 apolipopro~
## 2 HOXD1 4.40 0.632 16.3 0.0421 ENSG00000128645 homeobox D1
## 3 SNCA 45.7 0.0000000042 0.757 0.915 ENSG00000145335 synuclein ~
## 4 DEAF1 0 1 0 1 ENSG00000282712 DEAF1 tran~
## 5 DEAF1 0 1 0 1 ENSG00000177030 DEAF1 tran~
Notice how there are two rows for the DEAF1 gene that have identical values
except for the Stable Gene ID
column. This is a very common problem when
mapping gene symbols to other gene identifiers and there is no general solution
to picking the “best” mapping, short of manually inspecting all of the
duplicates and choosing which one is the most appropriate yourself (which
obviously is a huge amount of work). However, we do desire to remove the
duplicated rows. In this case, since all the values besides the Ensembl ID are
the same, effectively it doesn’t matter which duplicate rows we eliminate. We
can do this using the duplicated()
function, which returns TRUE
for all but
the first row of a set of duplicated values:
filter(gene_stats, !duplicated(gene))
## # A tibble: 4 x 7
## gene test1_stat test1_p test2_stat test2_p `Gene stable ID` `Gene name`
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 APOE 12.5 0.103 34.2 0.000013 ENSG00000130203 apolipopro~
## 2 HOXD1 4.40 0.632 16.3 0.0421 ENSG00000128645 homeobox D1
## 3 SNCA 45.7 0.0000000042 0.757 0.915 ENSG00000145335 synuclein ~
## 4 DEAF1 0 1 0 1 ENSG00000282712 DEAF1 tran~
However, in general, you must be careful about identifying these types of one-to-many mapping issues and also about how you mitigate them.