5.5 Tidy Data

The tidyverse packages are designed to operate with so-called “tidy data.” From the tidy data section of the R for Data Science book, the following rules make data tidy:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

Here, a variable is a quantity or property that every observation in our dataset has, and each observation is a separate instance of those variable (e.g. a different sample, subject, etc). In our gene_stats example tibble above, the columns referring to different test statistics are the variables, and each gene in each row is an “observation,” and each cell has a value; we can therefore say that the dataset is tidy:

gene_stats
## # A tibble: 3 x 5
##   gene  test1_stat      test1_p test2_stat  test2_p
##   <chr>      <dbl>        <dbl>      <dbl>    <dbl>
## 1 apoe       12.5  0.103            34.2   0.000013
## 2 hoxd1       4.40 0.632            16.3   0.0421  
## 3 snca       45.7  0.0000000042      0.757 0.915

Each row being an “observation” is somewhat abstract in this case; we could say that we “observed” the same test for all the genes in the tibble. Depending on what dataset we are working with, we sometimes have to be flexible in our conceptualization of what constitutes variables and observations.

The R for Data Science book depicts these rules in the following illustration:

Tidy data - from R for Data Science

These rules are pretty generic, and in general a dataset might require some work to manipulate it into tidy form. Fortunately for those of us working in biology and bioinformatics, very many of our datasets are already provided in a format that is very close to being tidy, or the tools we use to process biological data do so for us. For this reason, details about the tidying operations that might be needed for data in the general case are left for reading in the tidy data section of R for Data Science book.

There is one very big and common exception to the claim above that biological data is usually already tidy that. Briefly, from the illustration above, tidy data has observations as rows and variables as columns. However, the datasets that we often use, e.g. gene expression matrices, are organized to have variables (e.g. genes) as rows and observations (e.g. samples) as columns. This means some operations to summarize variables across observations, which are very common to compute, are not easily done with the tidyverse functions like mutate(). We describe how to work around this difference in the section Biological data is NOT Tidy!.