8.3 Bioconductor

Bioconductor is an organized collection of strictly biological analysis methods packages for R. These packages are hosted and maintained outside of CRAN because the maintainers enforce a more rigorous set of coding quality, testing, and documentation standards than is required by normal R packages. These stricter requirements arose from a recognition that software packages are generally only as usable as their documentation allows them to be, and also that many if not most of the users of these packages are not statisticians or experienced computer programmers. Rather, they are people like us: biological analysis practitioners who may or may not have substantial coding experience but must analyze our data nonetheless. The excellent documentation and community support of the bioconductor ecosystem is a primary reason why R is such a popular language in biological analysis.

Bioconductor is divided into roughly two sets of packages: core maintainer packages and user contributed packages. The core maintainer packages are among the most critical, because they define a set of common objects and classes (e.g. the ExpressionSet class in the Biobase package) that all Bioconductor packages know how to work with. This common base provides consistency among all Bioconductor packages thereby streamlining the learning process. User contributed and maintained packages provide specialized functionality for specific types of analysis.

Bioconductor itself must be installed prior to installing other Bioconductor packages. To [install bioconductor], you can run the following:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.14")

Bioconductor undergoes regular releases indicated by a version, e.g. version = "3.14" at the time of writing. Every bioconductor package also has a version, and each of those versions may or may not be compatible with a specific version of Bioconductor. To make matters worse, Bioconductor versions are only compatible with certain versions of R. Bioconductor version 3.14 requires R version 4.1.0, and will not work with older R versions. These versions can cause major version compatibility issues when you are forced to use older versions of R, as may sometimes be the case on managed compute clusters. There is not a good general solution for ensuring all your packages work together, but the general best rule of thumb is to use the most current version of R and all packages at the time when you write your code.

The BiocManager package is the only Bioconductor package installable using install.packages(). After installing the BiocManager package, you may then install bioconductor packages:

# installs the affy bioconductor package for microarray analysis
BiocManager::install("affy")

One key aspect of Bioconductor packages is consistent and helpful documentation; every package page on the Bioconductor.org site lists a block of code that will install the package, e.g. for the affy package. In addition, Biconductor provides three types of documentation:

  • Workflow tutorials on how to perform specific analysis use cases
  • Package vignettes for every package, which provide a worked example of how to use the package that can be adapted by the user to their specific case
  • Detailed, consistently formatted reference documentation that gives precise information on functionality and use of each package

The thorough and useful documentation of packages in Bioconductor is one of the reasons why the package ecosystem, and R, is so widely used in biology and bioinformatics.

The base Bioconductor packages define convenient data structures for storing and analyzing many types of data. Recall earlier in the Types of Biological Data section, we described five types of biological data: primary, processed, analysis, metadata, and annotation data. Bioconductor provides a convenient class for storing many of these different data types together in one place, specifically the SummarizedExperiment class in the package of the same name package. An illustration of what a SummarizedExperiment object stores is below, from the Bioconductor maintainer’s paper in Nature:

SummarizedExperiment Schematic - Huber, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nature Methods 12 (2): 115–21.

As you can see from the figure, this class stores processed data (assays), metadata (colData and exptData), and annotation data (rowData). The SummarizedExperiment class is used ubiquitously throughout the Bioconductor package ecosystem, as are other core classes some of which we will cover later in this chapter.