2.1 A Brief History of Data in Molecular Biology

Molecular biology became a data science in 1953 when the structure of DNA was determined. Prior to this advance, biochemical assays of biological systems could make general statements about the characteristics and composition of biological macromolecules (e.g. there are two types of nucleic acids - those made of ribose (RNA) and deoxyribose (DNA)) and some quantitative statements about those compositions (e.g. there are roughly equal concentrations of purines - adenine, guanine - and pyrimidines - cytosine, thymine - in any single chromosome). However, once it was shown that each nucleic acid molecule had a specific (and eventually measurable) sequence, this opened the possibility of defining the genetic signature of every living thing on Earth which, in principle, would enable us to understand how life works from its most basic components. A tantalizing prospect, to say the least.

It is perhaps a happy coincidence that our computational and data storage technologies began developing around the same time these molecular biology advances were being made. While mechanical computers had existed for more than a hundred years and arguably longer, the first modern computer, the Atanasoff-Berry Computer (ABC), was invented by John Vincent Atanasoff and Clifford Berry in 1942 at what is now Iowa State University. Over the following decades, the speed, sophistication, and reliability of computing machines increased exponentially, enabling ever larger and faster computations to be performed.

The development of computational capabilities necessitated technologies that stored information that these machines could use, both for instructions to tell the computers what operations they should perform and data they should use to perform them. Until the 1950s, the most commonly available mechanical data storage technologies like writing, phonographic cylinders and disks (a.k.a. records), and punch cards were impractical or unsuitable to create and store the amount of data needed by these computers. A newer technology, magnetic storage, originally proposed in 1888 by Oberlin Smith quickly became the standard way digital computers read and stored information.

With these technological advances, the second half of the 20th century saw rapid advances in our ability to determine and study the properties and function of biological molecular sequences, primarily DNA and RNA (although the first biological sequences scientists determined were proteins composed of amino acids using methods independently invented by Frederick Sanger and Pehr Edman).

In 1970, Pauline Hogeweg and Ben Hesper defined the new discipline of bioinformatics as “the study of informatic processes in biotic systems” (in fact, the original term was proposed in Dutch, Hogeweg and Hesper’s native language). This early form of bioinformatics was a subfield of theoretical biology, studied by those who recognized that biological systems, much like our computer systems, can be viewed as information storage and processing systems themselves.

This broad definition of bioinformatics began narrowing in practice to the study of genetic information as the amount of molecular sequence data we collected grew. By the early 1980s, biological sequence data stored on magnetic tape were being created and studied using new pattern recognition algorithms on computers, which were becoming more widely available at academic and research institutions. At the same time, the idea of determining the complete sequence of the human genome was born, leading to the inception of the The Human Genome Project and the present modern post genomic era.

Biological Data Timeline - Setting the Stage