styler
packagescale()
Sometimes we are interested in summarizing subsets of our data defined by some grouping variable. Consider the following made-up sample metadata for a set of individuals in an Alzheimer’s disease (AD) study:
<- tribble(
metadata ~ID, ~condition, ~age_at_death, ~Braak_stage, ~APOE_genotype,
"A01", "AD", 78, 5, "e4/e4",
"A02", "AD", 81, 6, "e3/e4",
"A03", "AD", 90, 5, "e4/e4",
"A04", "Control", 80, 1, "e3/e4",
"A05", "Control", 79, 0, "e3/e3",
"A06", "Control", 81, 0, "e2/e3"
)
This is a typical setup for metadata in these types of experiments. There is a sample ID column which uniquely identifies each subject, a condition variable indicating which group each subject is in, and clinical information like age at death, Braak stage (a measure of Alzheimer’s disease pathology in the brain), and diploid APOE genotype (e2 is associated with reduced risk of AD, e3 is baseline, and e4 confers increased risk).
An important experimental design consideration is to match sample attributes
between groups as well as possible to avoid confounding our comparison of
interest. In this case, age at death is one such variable that we wish to match
between groups. Although these values look pretty well matched between AD and
Control groups, it would be better to check explicitly. We can do this using
dplyr::group_by()
to group the rows together based on condition and
dplyr::summarize()
to compute the mean age at death for each group:
::group_by(metadata,
dplyr
condition%>% dplyr::summarize(mean_age_at_death = mean(age_at_death)) )
## # A tibble: 2 x 2
## condition mean_age_at_death
## <chr> <dbl>
## 1 AD 83
## 2 Control 80
The dplyr::group_by()
accepts a tibble and a column name for
a column that contains a categorical variable (i.e. a variable with discrete
values like AD and Control) and separates the rows in the tibble into groups
according to the distinct values of the column. The dplyr::summarize()
function accepts the grouped tibble and creates a new tibble with contents
defined as a function of values for columns in for each group.
From the example above, we see that the mean age at death is indeed different between the two groups, but not by much. We can go one step further and compute the standard deviation age range to further investigate:
::group_by(metadata,
dplyr
condition%>% dplyr::summarize(
) mean_age_at_death = mean(age_at_death),
sd_age_at_death = sd(age_at_death),
lower_age = mean_age_at_death-sd_age_at_death,
upper_age = mean_age_at_death+sd_age_at_death,
)
## # A tibble: 2 x 5
## condition mean_age_at_death sd_age_at_death lower_age upper_age
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AD 83 6.24 76.8 89.2
## 2 Control 80 1 79 81
Note the use of summarized variables defined first being used in variables defined later. Now we can see that the age ranges defined by +/- one standard deviation clearly overlap, which gives us more confidence that our average age at death for AD and Control are not significantly different.
We used +/- one standard deviation to define the likely mean age range above and below the arithmetic mean for simplicity in the example above. The proper way to assess whether these distributions are significantly different is to use an appropriate statistical test like a t-test.
Like other functions in dplyr
, dplyr::summarize()
has some helper functions
that give it additional functionality. One useful helper function is n()
,
which is defined as the number of records within each group. We will add one
more column to our summarized sample metadata from above that reports the number
of subjects within each condition:
::group_by(metadata,
dplyr
condition%>% dplyr::summarize(
) num_subjects = n(),
mean_age_at_death = mean(age_at_death),
sd_age_at_death = sd(age_at_death),
lower_age = mean_age_at_death-sd_age_at_death,
upper_age = mean_age_at_death+sd_age_at_death
)
## # A tibble: 2 x 6
## condition num_subjects mean_age_at_death sd_age_at_death lower_age upper_age
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AD 3 83 6.24 76.8 89.2
## 2 Control 3 80 1 79 81
We now have a column with the number of subjects in each condition.
Hadley Wickham is from New Zealand, which uses British, rather than American,
English. Therefore, in many places, both spellings are supported in the
tidyverse; e.g. both summarise()
and summarize()
are supported.