styler
packagescale()
The tidyverse packages are designed to operate with so-called “tidy data.” From the tidy data section of the R for Data Science book, the following rules make data tidy:
Here, a variable is a quantity or property that every observation in our dataset
has, and each observation is a separate instance of those variable (e.g. a
different sample, subject, etc). In our gene_stats
example tibble above, the
columns referring to different test statistics are the variables, and each gene
in each row is an “observation,” and each cell has a value; we can therefore say
that the dataset is tidy:
gene_stats
## # A tibble: 3 x 5
## gene test1_stat test1_p test2_stat test2_p
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 apoe 12.5 0.103 34.2 0.000013
## 2 hoxd1 4.40 0.632 16.3 0.0421
## 3 snca 45.7 0.0000000042 0.757 0.915
Each row being an “observation” is somewhat abstract in this case; we could say that we “observed” the same test for all the genes in the tibble. Depending on what dataset we are working with, we sometimes have to be flexible in our conceptualization of what constitutes variables and observations.
The R for Data Science book depicts these rules in the following illustration:
These rules are pretty generic, and in general a dataset might require some work to manipulate it into tidy form. Fortunately for those of us working in biology and bioinformatics, very many of our datasets are already provided in a format that is very close to being tidy, or the tools we use to process biological data do so for us. For this reason, details about the tidying operations that might be needed for data in the general case are left for reading in the tidy data section of R for Data Science book.
There is one very big and common exception to the claim above that biological
data is usually already tidy that. Briefly, from the illustration above, tidy
data has observations as rows and variables as columns. However, the datasets
that we often use, e.g. gene expression matrices, are organized to have
variables (e.g. genes) as rows and observations (e.g. samples) as columns.
This means some operations to summarize variables across observations, which are
very common to compute, are not easily done with the tidyverse functions like
mutate()
. We describe how to work around this difference in the section
Biological data is NOT Tidy!.