5.9 Rearranging Data

Sometimes the shape and format of our data is not the most convenient for performing certain operations on it, even if it is tidy. Let’s say we are considering the range of statistics that were computed for all of our genes in the gene_stats tibble, and wanted to compute the average statistic over all genes for both tests. Recall our tibble has separate columns for each test:

gene_stats <- tribble(
    ~gene, ~test1_stat, ~test1_p, ~test2_stat, ~test2_p,
   "APOE",   12.509293,   0.1032,   34.239521,   1.3e-5,
  "HOXD1",    4.399211,   0.6323,   16.332318,   0.0421,
   "SNCA",   45.748431,   4.2e-9,    0.757188,   0.9146,
)
gene_stats
## # A tibble: 3 x 5
##   gene  test1_stat      test1_p test2_stat  test2_p
##   <chr>      <dbl>        <dbl>      <dbl>    <dbl>
## 1 APOE       12.5  0.103            34.2   0.000013
## 2 HOXD1       4.40 0.632            16.3   0.0421  
## 3 SNCA       45.7  0.0000000042      0.757 0.915

For convenience, we desire our output to be in table form, with one row per test and the statistics for each test as columns. We could do this manually like so:

tribble(
   ~test_name, ~min, ~mean, ~max,
   "test1_stat", min(gene_stats$test1_stat), mean(gene_stats$test1_stat), max(gene_stats$test1_stat),
   "test2_stat",  min(gene_stats$test2_stat), mean(gene_stats$test2_stat), max(gene_stats$test2_stat),
)
## # A tibble: 2 x 4
##   test_name    min  mean   max
##   <chr>      <dbl> <dbl> <dbl>
## 1 test1_stat 4.40   20.9  45.7
## 2 test2_stat 0.757  17.1  34.2

This gets the job done, but is clearly very ugly, error prone, and would require significant work if we later added more statistics columns.

Instead of typing out the values we desire manually, we can pivot our tibble using the tidyr::pivot_longer() function, so that the column values are placed in a new column and the corresponding values are placed in yet another column. This process is illustrated in the following figure:

Pivot longer moves columns and values to two new columns

In the figure, the original tibble has genes along rows and samples as columns. When the sample columns are pivoted, the value of each column name is placed in a new column named “Sample” and repeated for as many rows there are in the tibble. A second new column “Value” is populated with the corresponding values that were in each column. The gene associated with each value is preserved and repeated vertically until all the table columns and values have been pivoted. This process of pivoting transforms the tibble into so called “long” form.

Returning to our gene_stats example, we can apply some operations to the tibble to easily perform the summarization we did above in a much more expressive manner:

long_gene_stats <- tidyr::pivot_longer(
  gene_stats,
  ends_with("_stat"),
  names_to="test",
  values_to="stat"
)
long_gene_stats
## # A tibble: 6 x 5
##   gene       test1_p  test2_p test         stat
##   <chr>        <dbl>    <dbl> <chr>       <dbl>
## 1 APOE  0.103        0.000013 test1_stat 12.5  
## 2 APOE  0.103        0.000013 test2_stat 34.2  
## 3 HOXD1 0.632        0.0421   test1_stat  4.40 
## 4 HOXD1 0.632        0.0421   test2_stat 16.3  
## 5 SNCA  0.0000000042 0.915    test1_stat 45.7  
## 6 SNCA  0.0000000042 0.915    test2_stat  0.757

We see that now instead of having X_stat columns, the column names and their values have been put into the test and stat columns, respectively. Now to summarize the statistics for each test, we simply do a group_by() on the test column and compute summaries on the stat column:

long_gene_stats %>%
  dplyr::group_by(test) %>%
  dplyr::summarize(min = min(stat), mean = mean(stat), max = max(stat))
## # A tibble: 2 x 4
##   test         min  mean   max
##   <chr>      <dbl> <dbl> <dbl>
## 1 test1_stat 4.40   20.9  45.7
## 2 test2_stat 0.757  17.1  34.2

You may verify that the numbers are identical in this pivoted tibble as the one we manually created earlier. This pivoting method will produce the desired output regardless of the number of tests we include the table, so long as the column names end in "_test".

The inverse of pivot_longer() is pivot_wider(). If you have variables gathered in single columns like that produced by pivot_longer() you can reverse the process with this function to create a tibble with those variables as columns.