5.6 pipes

One of the key tidyverse programming patterns is chaining manipulations of tibbles together using the %>% operator. We very often want to perform serial operations on a data frame, for example read in a file, rename one of the columns, subset the rows based on some criteria, and compute summary statistics on the result. We might implement such operations using a variable and assignment:

# data_file.csv has two columns: bad_cOlumn_name and numeric_column
data <- readr::read_csv("data_file.csv")
data <- dplyr::rename(data, "better_column_name"=bad_cOlumn_name)
data <- dplyr::filter(data, better_column_name %in% c("condA","condB"))
data_grouped <- dplyr::group_by(data, better_column_name)
summarized <- dplyr::summarize(data_grouped, mean(numeric_column))

The repeated use of data and the intermediate data_grouped variable may be unnecessary if you’re only interested in the summarized result. The code is also not very straightforward to read. Using the %>% operator, we can write the same sequence of operations in a much more concise manner:

data <- readr::read_csv("data_file.csv") %>%
      dplyr::rename("better_column_name"=bad_cOlumn_name) %>%
      dplyr::filter(better_column_name %in% c("condA","condB")) %>%
      dplyr::group_by(better_column_name) %>%
      dplyr::summarize(mean(numeric_column))

Note that the function calls in the piped example do not have the data variable passed in explicitly. This is because the %>% operator passes the result of the function immediately preceding it as the first argument to the next function automatically. This convention allows us to focus on writing only the important parts of the code that perform the logic of our analysis, and avoid unnecessary and potentially distracting additional characters that make the code less readable.