Following Data - Life Out of Sequence

Biology Reference

In-Depth Information

ing began. As DNA sequencing became ubiquitous, the genetic code

was supplemented by other letters, such as R (G or A), B (G, T, or C) or

N (A, G, T, or C). Such tolerance for ambiguity refl ected the fact that

sequencing was still suffi ciently expensive that some information was

better than none. Such a code is already a signifi cant abstraction from

a “physical” DNA molecule, which might contain nonstandard nucle-

otides, epigenetic markers, three-dimensional conformations, or other

chemical variations. 9

Despite the standard AGTC shorthand, variation in coding was still

problematic for bioinformaticians. Different software programs, for

example, required sequences to be submitted in different formats. By

the late 1980s, however, the superior speed of Bill Pearson and David

Lipman's FASTA program meant that its input format gradually became

a standard for the encoding of sequences. Even though FASTA was su-

perseded by BLAST, the FASTA format persisted. 10 The format consists

of a fi rst line that begins with a “>” on which information identifying

the sequence is listed, followed by the sequence itself on subsequent

lines. The appearance of another “>” in the fi le indicates the beginning

of a new sequence. Although FASTA has been criticized because the

identifi cation line lacks a detailed structure, it remains the de facto stan-

dard for sharing and transferring sequence data. Its simplicity makes

it particularly attractive to programmers who want to be able to parse

sequence data into their programs quickly and easily. It is a simple but

powerful data structure to which sequences must conform.

The ready existence of a widely agreed-upon code has made the

sharing of sequence data relatively straightforward. The sharing of

other kinds of biological data, however, has required more elaborate

schemes. In particular, it is what are known as “annotation data” that

have caused the greatest problems. Annotation data include all the in-

formation attached to a sequence that is used to describe it: its name,

its function, its origin, publication information, the genes or coding re-

gions it contains, exons and introns, transcription start sites, promoter

regions, and so on. The problem is that such data can be stored in a

variety of ways; different descriptions in natural language (for example:

“Homo sapiens,” “H. sapiens,” “homo sapiens,” “Homo_sapiens”), dif-

ferent coordinate systems, and different defi nitions of features (for in-

stance, how one defi nes and delimits a gene) inhibit compatibility and

interoperability. There are two kinds of solutions to these problems, one

of which I will call “centralized” and the other “democratic.”

The democratic approach is known as a distributed annotation sys-

Search WWH ::

Custom Search

Home