5.10 Relational Data

As mentioned in our section on types of biological data, we often need to combine different sources of data together to aid in interpretation of our results. Below we redefine the tibble of gene statistics from above to have properly capitalized gene symbols:

gene_stats <- tribble(
    ~gene, ~test1_stat, ~test1_p, ~test2_stat, ~test2_p,
   "APOE",   12.509293,   0.1032,   34.239521,   1.3e-5,
  "HOXD1",    4.399211,   0.6323,   16.332318,   0.0421,
   "SNCA",   45.748431,   4.2e-9,    0.757188,   0.9146,
)
gene_stats
## # A tibble: 3 x 5
##   gene  test1_stat      test1_p test2_stat  test2_p
##   <chr>      <dbl>        <dbl>      <dbl>    <dbl>
## 1 APOE       12.5  0.103            34.2   0.000013
## 2 HOXD1       4.40 0.632            16.3   0.0421  
## 3 SNCA       45.7  0.0000000042      0.757 0.915

Our gene identifiers in this data frame are gene symbols which, while convenient for our human brains to remember, can change over time and have many aliases (e.g. the APOE gene has also been called AD2, LDLCQ5, APO-E, and ApoE4 as listed on its genecard). This can make writing code that refers to a specific gene difficult, since there are so many possible names to look for. Fortunately, there are alternative gene identifier systems that do a better job of maintaining stable, consistent gene identifiers, one of the most popular being Ensembl. Ensembl gene IDs always take the form ENSGNNNNNNNNNNN where Ns are digits. These IDs are much more stable and predictable than gene symbols, and are preferable when working with genes in code.

We wish to add Ensembl IDs for the genes in our gene_stats result to the tibble as a new column. Now suppose we have obtained another file with cross referenced gene identifiers like the following:

gene_map <- tribble(
    ~symbol,            ~ENSGID,                    ~gene_name,
     "APOE",  "ENSG00000130203",            "apolipoprotein E",
     "BRCA1",  "ENSG00000012048", "BRCA1 DNA repair associated",
    "HOXD1",  "ENSG00000128645",                 "homeobox D1",
     "SNCA",  "ENSG00000145335",             "synuclein alpha",
)
gene_map
## # A tibble: 4 x 3
##   symbol ENSGID          gene_name                  
##   <chr>  <chr>           <chr>                      
## 1 APOE   ENSG00000130203 apolipoprotein E           
## 2 BRCA1  ENSG00000012048 BRCA1 DNA repair associated
## 3 HOXD1  ENSG00000128645 homeobox D1                
## 4 SNCA   ENSG00000145335 synuclein alpha

Imagine that this file contains mappings for all ~60,000 genes in the human genome. While it might be simple to look up our three genes in this file and annotate manually, it is easier to ask dplyr to do it for us. We can do this using the dplyr::left_join() function which accepts two data frames and the names of columns in each that share common values:

dplyr::left_join(
    x=gene_stats,
    y=gene_map,
    by=c("gene" = "symbol")
)
## # A tibble: 3 x 7
##   gene  test1_stat      test1_p test2_stat  test2_p ENSGID          gene_name   
##   <chr>      <dbl>        <dbl>      <dbl>    <dbl> <chr>           <chr>       
## 1 APOE       12.5  0.103            34.2   0.000013 ENSG00000130203 apolipoprot~
## 2 HOXD1       4.40 0.632            16.3   0.0421   ENSG00000128645 homeobox D1 
## 3 SNCA       45.7  0.0000000042      0.757 0.915    ENSG00000145335 synuclein a~

Notice that the additional columns in gene_map that were not involved in the join (i.e. ENSG and gene_name) are appended to the gene_stats tibble. If we wanted to use pipes, we could implement the same join above as follows:

gene_stats %>% dplyr::left_join(
  gene_map,
  by=c("gene" = "symbol")
)
## # A tibble: 3 x 7
##   gene  test1_stat      test1_p test2_stat  test2_p ENSGID          gene_name   
##   <chr>      <dbl>        <dbl>      <dbl>    <dbl> <chr>           <chr>       
## 1 APOE       12.5  0.103            34.2   0.000013 ENSG00000130203 apolipoprot~
## 2 HOXD1       4.40 0.632            16.3   0.0421   ENSG00000128645 homeobox D1 
## 3 SNCA       45.7  0.0000000042      0.757 0.915    ENSG00000145335 synuclein a~

Under the hood, the dplyr::left_join() function took the values in gene_stats$gene and looked for the corresponding row in gene_map with the same value in in the symbol column. It then appends all the additional columns of gene_map for the matching rows and returns the result. We have added the identifier mapping we desired with only a few lines of code. And what’s more, this code will work no matter how many genes we had in gene_stats as long as gene_map contains a mapping value for the values in gene_stats$gene.

But what happens if we have genes in gene_stats that don’t exist in our mapping? In the above example, we use a left join because we want to include all the rows in gene_stats regardless of whether a mapping exists in gene_map. In this case this was fine because all of our genes in gene_stats had a corresponding row in the mapping. However, notice that in the mapping there is an additional gene, BRCA1 that is not in our gene statistics tibble. If we reverse the order of the join, observe what happens:

gene_map %>% dplyr::left_join(
  gene_stats,
  by=c("symbol" = "gene")
)
## # A tibble: 4 x 7
##   symbol ENSGID          gene_name       test1_stat  test1_p test2_stat  test2_p
##   <chr>  <chr>           <chr>                <dbl>    <dbl>      <dbl>    <dbl>
## 1 APOE   ENSG00000130203 apolipoprotein~      12.5   1.03e-1     34.2    1.3 e-5
## 2 BRCA1  ENSG00000012048 BRCA1 DNA repa~      NA    NA           NA     NA      
## 3 HOXD1  ENSG00000128645 homeobox D1           4.40  6.32e-1     16.3    4.21e-2
## 4 SNCA   ENSG00000145335 synuclein alpha      45.7   4.2 e-9      0.757  9.15e-1

Now the order of the rows is the same as in gene_map, and the columns for our missing gene BRCA1 are filled with NA. This is the left join at work, where the record from gene_map is included regardless of whether a corresponding value was found in gene_stats.

There are additional types of joins besides left joins. Right joins are simply the opposite of left joins:

gene_stats %>% dplyr::right_join(
  gene_map,
  by=c("gene" = "symbol")
)
## # A tibble: 4 x 7
##   gene  test1_stat       test1_p test2_stat   test2_p ENSGID          gene_name 
##   <chr>      <dbl>         <dbl>      <dbl>     <dbl> <chr>           <chr>     
## 1 APOE       12.5   0.103            34.2    0.000013 ENSG00000130203 apolipopr~
## 2 HOXD1       4.40  0.632            16.3    0.0421   ENSG00000128645 homeobox ~
## 3 SNCA       45.7   0.0000000042      0.757  0.915    ENSG00000145335 synuclein~
## 4 BRCA1      NA    NA                NA     NA        ENSG00000012048 BRCA1 DNA~

The result is the same as the left join on gene_map except the order of the resulting columns is different.

Inner joins return results for rows that have a match between the two tibbles. Recall our left join on gene_map included BRCA1 even though it was not found in gene_stats. An inner join will not include this row, because no match in gene_stats was found:

gene_map %>% dplyr::inner_join(
  gene_stats,
  by=c("symbol" = "gene")
)
## # A tibble: 3 x 7
##   symbol ENSGID          gene_name        test1_stat  test1_p test2_stat test2_p
##   <chr>  <chr>           <chr>                 <dbl>    <dbl>      <dbl>   <dbl>
## 1 APOE   ENSG00000130203 apolipoprotein E      12.5   1.03e-1     34.2   1.3 e-5
## 2 HOXD1  ENSG00000128645 homeobox D1            4.40  6.32e-1     16.3   4.21e-2
## 3 SNCA   ENSG00000145335 synuclein alpha       45.7   4.2 e-9      0.757 9.15e-1

One last type of join is the dplyr::full_join() (also sometimes called an outer join). As you may expect, a full join will return all rows from both tibbles whether a match in the other table was found or not.

5.10.1 Dealing with multiple matches

In the example above, there was a one-to-one relationship between the gene symbols in both tibbles. This made the joined tibble tidy. However, when a one-to-many relationship exists, i.e. one gene symbol in one tibble has multiple rows in the other, this can lead to what appears to be duplicate rows in the joined result. Due to the relative instability of gene symbols, it is very common to have multiple Ensembl genes associated with a single gene symbol. The following takes a gene mapping of Ensembl IDs to gene symbols and identifies cases where multiple Ensembl IDs map to a single gene symbol:

readr::read_tsv("mart_export.tsv") %>%
  dplyr::filter(
    `HGNC symbol` != "NA" & # many unstudied genes have Ensembl IDs but no official symbol
    `HGNC symbol` %in% `HGNC symbol`[duplicated(`HGNC symbol`)]) %>%
  dplyr::arrange(`HGNC symbol`)
## # A tibble: 7,268 x 3
##    `Gene stable ID` `HGNC symbol` `Gene name`                                
##    <chr>            <chr>         <chr>                                      
##  1 ENSG00000261846  AADACL2       arylacetamide deacetylase like 2           
##  2 ENSG00000197953  AADACL2       arylacetamide deacetylase like 2           
##  3 ENSG00000262466  AADACL2-AS1   <NA>                                       
##  4 ENSG00000242908  AADACL2-AS1   <NA>                                       
##  5 ENSG00000276072  AATF          apoptosis antagonizing transcription factor
##  6 ENSG00000275700  AATF          apoptosis antagonizing transcription factor
##  7 ENSG00000281173  ABCB10P1      <NA>                                       
##  8 ENSG00000280461  ABCB10P1      <NA>                                       
##  9 ENSG00000274099  ABCB10P1      <NA>                                       
## 10 ENSG00000282479  ABCB10P3      <NA>                                       
## # ... with 7,258 more rows

There are over 7,000 Ensembl IDs that map to the same gene symbol as another Ensembl ID. That is more than 10% of all Ensembl IDs. So now let’s create a new gene_stats tibble with one of these gene symbols and join with the map to see what happens:

gene_map <- readr::read_tsv("mart_export.tsv")
gene_stats <- tribble(
    ~gene, ~test1_stat, ~test1_p, ~test2_stat, ~test2_p,
   "APOE",   12.509293,   0.1032,   34.239521,   1.3e-5,
  "HOXD1",    4.399211,   0.6323,   16.332318,   0.0421,
   "SNCA",   45.748431,   4.2e-9,    0.757188,   0.9146,
  "DEAF1",    0.000000,      1.0,           0,      1.0
) %>% left_join(gene_map, by=c("gene" = "HGNC symbol"))
gene_stats
## # A tibble: 5 x 7
##   gene  test1_stat      test1_p test2_stat  test2_p `Gene stable ID` `Gene name`
##   <chr>      <dbl>        <dbl>      <dbl>    <dbl> <chr>            <chr>      
## 1 APOE       12.5  0.103            34.2   0.000013 ENSG00000130203  apolipopro~
## 2 HOXD1       4.40 0.632            16.3   0.0421   ENSG00000128645  homeobox D1
## 3 SNCA       45.7  0.0000000042      0.757 0.915    ENSG00000145335  synuclein ~
## 4 DEAF1       0    1                 0     1        ENSG00000282712  DEAF1 tran~
## 5 DEAF1       0    1                 0     1        ENSG00000177030  DEAF1 tran~

Notice how there are two rows for the DEAF1 gene that have identical values except for the Stable Gene ID column. This is a very common problem when mapping gene symbols to other gene identifiers and there is no general solution to picking the “best” mapping, short of manually inspecting all of the duplicates and choosing which one is the most appropriate yourself (which obviously is a huge amount of work). However, we do desire to remove the duplicated rows. In this case, since all the values besides the Ensembl ID are the same, effectively it doesn’t matter which duplicate rows we eliminate. We can do this using the duplicated() function, which returns TRUE for all but the first row of a set of duplicated values:

filter(gene_stats, !duplicated(gene))
## # A tibble: 4 x 7
##   gene  test1_stat      test1_p test2_stat  test2_p `Gene stable ID` `Gene name`
##   <chr>      <dbl>        <dbl>      <dbl>    <dbl> <chr>            <chr>      
## 1 APOE       12.5  0.103            34.2   0.000013 ENSG00000130203  apolipopro~
## 2 HOXD1       4.40 0.632            16.3   0.0421   ENSG00000128645  homeobox D1
## 3 SNCA       45.7  0.0000000042      0.757 0.915    ENSG00000145335  synuclein ~
## 4 DEAF1       0    1                 0     1        ENSG00000282712  DEAF1 tran~

However, in general, you must be careful about identifying these types of one-to-many mapping issues and also about how you mitigate them.