5.8 Grouping Data

Sometimes we are interested in summarizing subsets of our data defined by some grouping variable. Consider the following made-up sample metadata for a set of individuals in an Alzheimer’s disease (AD) study:

metadata <- tribble(
    ~ID, ~condition, ~age_at_death, ~Braak_stage, ~APOE_genotype,
  "A01",        "AD",            78,            5,       "e4/e4",
  "A02",        "AD",            81,            6,       "e3/e4",
  "A03",        "AD",            90,            5,       "e4/e4",
  "A04",   "Control",            80,            1,       "e3/e4",
  "A05",   "Control",            79,            0,       "e3/e3",
  "A06",   "Control",            81,            0,       "e2/e3"
)

This is a typical setup for metadata in these types of experiments. There is a sample ID column which uniquely identifies each subject, a condition variable indicating which group each subject is in, and clinical information like age at death, Braak stage (a measure of Alzheimer’s disease pathology in the brain), and diploid APOE genotype (e2 is associated with reduced risk of AD, e3 is baseline, and e4 confers increased risk).

An important experimental design consideration is to match sample attributes between groups as well as possible to avoid confounding our comparison of interest. In this case, age at death is one such variable that we wish to match between groups. Although these values look pretty well matched between AD and Control groups, it would be better to check explicitly. We can do this using dplyr::group_by() to group the rows together based on condition and dplyr::summarize() to compute the mean age at death for each group:

dplyr::group_by(metadata,
  condition
) %>% dplyr::summarize(mean_age_at_death = mean(age_at_death))
## # A tibble: 2 x 2
##   condition mean_age_at_death
##   <chr>                 <dbl>
## 1 AD                       83
## 2 Control                  80

The dplyr::group_by() accepts a tibble and a column name for a column that contains a categorical variable (i.e. a variable with discrete values like AD and Control) and separates the rows in the tibble into groups according to the distinct values of the column. The dplyr::summarize() function accepts the grouped tibble and creates a new tibble with contents defined as a function of values for columns in for each group.

From the example above, we see that the mean age at death is indeed different between the two groups, but not by much. We can go one step further and compute the standard deviation age range to further investigate:

dplyr::group_by(metadata,
  condition
) %>% dplyr::summarize(
  mean_age_at_death = mean(age_at_death),
  sd_age_at_death = sd(age_at_death),
  lower_age = mean_age_at_death-sd_age_at_death,
  upper_age = mean_age_at_death+sd_age_at_death,
)
## # A tibble: 2 x 5
##   condition mean_age_at_death sd_age_at_death lower_age upper_age
##   <chr>                 <dbl>           <dbl>     <dbl>     <dbl>
## 1 AD                       83            6.24      76.8      89.2
## 2 Control                  80            1         79        81

Note the use of summarized variables defined first being used in variables defined later. Now we can see that the age ranges defined by +/- one standard deviation clearly overlap, which gives us more confidence that our average age at death for AD and Control are not significantly different.

We used +/- one standard deviation to define the likely mean age range above and below the arithmetic mean for simplicity in the example above. The proper way to assess whether these distributions are significantly different is to use an appropriate statistical test like a t-test.

Like other functions in dplyr, dplyr::summarize() has some helper functions that give it additional functionality. One useful helper function is n(), which is defined as the number of records within each group. We will add one more column to our summarized sample metadata from above that reports the number of subjects within each condition:

dplyr::group_by(metadata,
  condition
) %>% dplyr::summarize(
  num_subjects = n(),
  mean_age_at_death = mean(age_at_death),
  sd_age_at_death = sd(age_at_death),
  lower_age = mean_age_at_death-sd_age_at_death,
  upper_age = mean_age_at_death+sd_age_at_death
)
## # A tibble: 2 x 6
##   condition num_subjects mean_age_at_death sd_age_at_death lower_age upper_age
##   <chr>            <int>             <dbl>           <dbl>     <dbl>     <dbl>
## 1 AD                   3                83            6.24      76.8      89.2
## 2 Control              3                80            1         79        81

We now have a column with the number of subjects in each condition.

Hadley Wickham is from New Zealand, which uses British, rather than American, English. Therefore, in many places, both spellings are supported in the tidyverse; e.g. both summarise() and summarize() are supported.