styler
packagescale()
Sometimes the shape and format of our data is not the most convenient for
performing certain operations on it, even if it is tidy. Let’s say
we are considering the range of statistics that were computed for all of our
genes in the gene_stats
tibble, and wanted to compute the average statistic
over all genes for both tests. Recall our tibble has separate columns for each
test:
<- tribble(
gene_stats ~gene, ~test1_stat, ~test1_p, ~test2_stat, ~test2_p,
"APOE", 12.509293, 0.1032, 34.239521, 1.3e-5,
"HOXD1", 4.399211, 0.6323, 16.332318, 0.0421,
"SNCA", 45.748431, 4.2e-9, 0.757188, 0.9146,
) gene_stats
## # A tibble: 3 x 5
## gene test1_stat test1_p test2_stat test2_p
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 APOE 12.5 0.103 34.2 0.000013
## 2 HOXD1 4.40 0.632 16.3 0.0421
## 3 SNCA 45.7 0.0000000042 0.757 0.915
For convenience, we desire our output to be in table form, with one row per test and the statistics for each test as columns. We could do this manually like so:
tribble(
~test_name, ~min, ~mean, ~max,
"test1_stat", min(gene_stats$test1_stat), mean(gene_stats$test1_stat), max(gene_stats$test1_stat),
"test2_stat", min(gene_stats$test2_stat), mean(gene_stats$test2_stat), max(gene_stats$test2_stat),
)
## # A tibble: 2 x 4
## test_name min mean max
## <chr> <dbl> <dbl> <dbl>
## 1 test1_stat 4.40 20.9 45.7
## 2 test2_stat 0.757 17.1 34.2
This gets the job done, but is clearly very ugly, error prone, and would require significant work if we later added more statistics columns.
Instead of typing out the values we desire manually, we can pivot our tibble
using the tidyr::pivot_longer()
function, so that the column values are placed
in a new column and the corresponding values are placed in yet another column.
This process is illustrated in the following figure:
In the figure, the original tibble has genes along rows and samples as columns. When the sample columns are pivoted, the value of each column name is placed in a new column named “Sample” and repeated for as many rows there are in the tibble. A second new column “Value” is populated with the corresponding values that were in each column. The gene associated with each value is preserved and repeated vertically until all the table columns and values have been pivoted. This process of pivoting transforms the tibble into so called “long” form.
Returning to our gene_stats
example, we can apply some operations to the
tibble to easily perform the summarization we did above in a much more
expressive manner:
<- tidyr::pivot_longer(
long_gene_stats
gene_stats,ends_with("_stat"),
names_to="test",
values_to="stat"
) long_gene_stats
## # A tibble: 6 x 5
## gene test1_p test2_p test stat
## <chr> <dbl> <dbl> <chr> <dbl>
## 1 APOE 0.103 0.000013 test1_stat 12.5
## 2 APOE 0.103 0.000013 test2_stat 34.2
## 3 HOXD1 0.632 0.0421 test1_stat 4.40
## 4 HOXD1 0.632 0.0421 test2_stat 16.3
## 5 SNCA 0.0000000042 0.915 test1_stat 45.7
## 6 SNCA 0.0000000042 0.915 test2_stat 0.757
We see that now instead of having X_stat
columns, the column names and their
values have been put into the test
and stat
columns, respectively. Now to
summarize the statistics for each test, we simply do a group_by()
on the
test
column and compute summaries on the stat
column:
%>%
long_gene_stats ::group_by(test) %>%
dplyr::summarize(min = min(stat), mean = mean(stat), max = max(stat)) dplyr
## # A tibble: 2 x 4
## test min mean max
## <chr> <dbl> <dbl> <dbl>
## 1 test1_stat 4.40 20.9 45.7
## 2 test2_stat 0.757 17.1 34.2
You may verify that the numbers are identical in this pivoted tibble as the one
we manually created earlier. This pivoting method will produce the desired
output regardless of the number of tests we include the table, so long as the
column names end in "_test"
.
The inverse of pivot_longer()
is
pivot_wider()
. If
you have variables gathered in single columns like that produced by
pivot_longer()
you can reverse the process with this function to create a
tibble with those variables as columns.