styler
packagescale()
High throughput sequencing (HTS) technologies measure and digitize the nucleotide sequences of thousands to billions of DNA molecules simultaneously. In contrast to microarrays, which cannot identify new sequences due to their a priori probe based design, HTS instruments can in principle determine the sequence of any DNA molecule in a sample. For this reason, HTS datasets are sometimes called unbiased assays, because the sequences they identify are not predetermined.
The most popular HTS technology at the time of writing is provided by the Illumina biotechnology company. This technology was first developed and commercialized by the company Solexa during the late 1990s and was instrumental in the completion of the human genome. The cost of generating data with this technology has reduced by several orders of magnitude since then, resulting in an exponential growth of biological data and fundamentally changed our understanding of biology and health. Because of this, the analysis of these data, and not the time and cost of generation, has become the primary bottleneck in biological and biomedical research.
The Illumina HTS technology uses a biotechnological technique called sequencing by synthesis. Briefly, this technique entails biochemically ligating millions to billions of short (~300 nucleotide long) DNA molecules to a glass slide, denatured to become single stranded, and then used to synthesize the complementary strand by incorporating fluorescently tagged nucleotides. The tagged nucleotides that are incorporated on each nucleotide addition are excited with a laser and a high resolution photograph is taken of the flow cell. This process is repeated many times until the DNA molecules on the flow cell have been completely synthesized, and the collective images are computationally analyzed to build up the complete sequence of each molecule.
Other HTS methods have been developed that use different strategies, some of which do not utilize the sequencing by synthesis method. These alternative technologies differ in the length, number, confidence and per base cost of the DNA molecules they sequence. At the time of writing, the two most common technologies include:
Due to its widespread use, only Illumina sequencing data analysis is covered in this book, though additions in the future may be warranted.
Any DNA molecules may be subjected to sequencing via this method; the meaning of the resulting datasets and the goal of subsequent analysis are therefore dependent upon the material being studied. Below are some of the most common types of HTS experiments:
A complete coverage of the many different types of sequencing assays is beyond the scope of this book, but analysis techniques used for of some of these different types of experiments are covered later in this chapter.
The raw data produced by the sequencing instrument, or sequencer, are the digitized DNA sequences for each of the billions of molecules that were captured by the flow cell. The digitized nucleotide sequence for each molecule is called a read, and therefore every run of a sequencing instrument produces millions to billions of reads. The data is stored in a standardized format called the FASTQ file format. Data for a single read in FASTQ format is shown below:
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Each read has four lines in the FASTQ file. The first beginning with @
is the
sequencing header or name, which provides a unique identifier for the read in
the file. The second line is the nucleotide sequence as detected by the
sequencer. The third line is a secondary header line that starts with +
and
may or may not include further information. The last line contains the phred
quality scores for the base
calls in the corresponding position of the read. Briefly, each character in the
phred quality score line has a corresponding integer is proportional to the
confidence that the base in the corresponding position of the read was called
correctly. This information is used to assess the quality of the data. See this
more in-depth explanation of phred
score for more information.
Since all HTS experiments generate the same type of data (i.e. reads), all data analyses begin with processing of the sequences via bioinformatic algorithms; the choice of algorithm is determined by the experiment type and the biological question being asked. A full treatment of the variety of bioinformatic approaches to processing these data is beyond the scope of this book. We will focus on one analysis approach, alignment, that is commonly employed to produce derived data that is then further processed using R.
It is very common for HTS data, also called short read sequencing data, to be generated from organisms that have already had their genome sequenced, e.g. human or mouse. The reads in a dataset can therefore be compared with the appropriate genome sequence to aid in interpretation. This comparison amounts to searching in the organism’s genome, or the so-called reference genome, for the location or locations from which the molecule captured by each read originated. This search process is called alignment, where each short read sequence is aligned against the usually much larger reference genome to identify locations with identical or nearly identical sequence. Said differently, the alignment process attempts to assign each read to one or more locations in a reference genome, where only reads that have a match in the target genome will be assigned.
The alignment process is computationally intensive and requires specialized software to perform efficiently. While it may be technically capable of performing alignment, R is not optimized to process character data in the volume typical of these datasets. Therefore, it is not common to work with short read data (i.e. FASTQ files) directly in R, and other processing steps are required to transform the sequencing data into other forms that are then amenable to analysis in R.
The end result of aligning all the reads in a short read dataset is assignments of reads to positions across an entire reference genome. The distribution of those read alignment locations across the genome forms the basis of the data used to interpret the dataset. Specifically, each locus, or location spanning one or more bases, has zero or more reads aligned to it. The number of reads aligned to a locus, or the read count is commonly interpreted as being proportional to the number of those molecules found in the original sample. Therefore, the analysis and interpretation of HTS data commonly involves the analysis of count data, which we discuss in detail in the next section. The following figure illustrates the concepts described in this section.
As mentioned above, the number of reads, or read counts that align to each locus of a genome form a distribution we can use to interpret the data. Count data have two important properties that require special attention:
These two properties have the important consequence that count data are not normally distributed. Unlike with microarray data, common statistical techniques like linear regression are not directly applicable to count data. We must therefore account for this property of the data prior to statistical analysis in one of two ways:
We will discuss these strategies in greater detail in the RNASeq Gene Expression Data section.