2 Introduction
Since the publication of the first draft human genome in 2001, data has driven biological discovery at an exponential rate. Rapid technological innovations in data-generating biochemical instruments, computational resource availability, data storage, and analytical approaches including artificial intelligence, machine learning, and data science more generally have combined and synergized to enable advances in our understanding of biological systems by orders of magnitude. As the rate of development of these technologies has increased, so are practitioners of biological inquiry expected to keep up with the rapidly expanding set of knowledge, skills, and tools required to use them.
Modern biological data analysis entails combining knowledge and skills from many domains, including not only biological concepts like molecular biology, genetics, genomics, and biochemistry, but also in computational and quantitative skills including statistics, mathematics, programming and software engineering, high performance and cloud computing, data visualization, and computer science. No one person can be expert in all of these areas, but modern software tools and packages made available by subject matter experts enable us to perform cutting edge analysis with a conceptual understanding of the topics.
One such tool is the R programming language, a statistical programming language and environment specifically designed to run statistical analyses and visualize data. Today R is one of the two most popular programming languages in biological data analysis and bioinformatics (the other being python).
A major innovation in the R language came with the introduction of the tidyverse, as set of open-source data manipulation and visualization packages, first developed by Hadley Wickham and now improved, supported, and maintained by his team of data scientists and software engineers and other individuals. The tidyverse is a collection of packages that specialize in different aspects of data manipulation with the goal of enabling powerful, consistent, and accurate data operations in the broad field of data science. While not changing the structure of the language per se, the tidyverse packages define a set of consistent programming conventions and patterns that are tailored to the types of manipulations required to make data “tidy” and, therefore, easier and more consistent to work with. The tidyverse therefore is something of its own “language” that is compatible with but distinct in convention from the base R language.
This book and accompanying course focus on how to use R and its related package ecosystems to analyze, visualize, and communicate biological data analyses. As noted above, effective biological data analysis employs skills that span several knowledge domains. This book covers many of these topics in relatively shallow depth, but with the intent of presenting just enough in each to enable the learner to become proficient in most day-to-day biological analysis tasks.
2.1 Who This Book Is For
This book was written for the practicing biologist wishing to learn how to use R to analyze biological data. A basic working knowledge of genetics, genomics, molecular biology, and biochemistry is assumed, but we endeavored to include enough pertinent background to understand the analysis concepts presented in the text. Basic knowledge of statistics is assumed, but again some background is provided as necessary to understand the analyses and concepts in the text. No further knowledge is assumed.
2.2 A Note About Reinventing the Wheel
Many topics in this book are covered elsewhere in greater detail and depth. The content in each section is intended to stand alone, but may not provide a high level of detail that has been done better by others in online materials. These sections provide links to these other resources that provide more information, in case the instructions in this book are too terse or unclear.
2.3 Sources and References
The materials of this book were inspired and informed by a large number of sources, including books and freely available online materials. The authors would like to thank the creators and maintainers of these resources for their generosity in making their valuable contributions:
R Materials
- Hands-On Programming with R, by Garrett Grolemund
- R for Data Science, by Hadley Wickam, Garrett Grolemund, et al
- Advanced R, by Hadley Wickam
- STAT 545 - Data wrangling, exploration, and analysis with R
- What They Forgot to Teach You About R
- Reproducible Analysis with R, by State of Alaska’s Salmon and People Project, NCEAS
- Data Science for Psychologists, by Hansjörg Neth
Data visualization
- How Charts Lie: Getting Smarter about Visual Information, by Alberto Cairo
- The Functional Art - An Introduction to Information Graphics and Visualization, by Alberto Cairo
- The Truthful Art - Data, Charts, and Maps for Communication, by Alberto Cairo