Biology Reference
In-Depth Information
ing began. As DNA sequencing became ubiquitous, the genetic code
was supplemented by other letters, such as R (G or A), B (G, T, or C) or
N (A, G, T, or C). Such tolerance for ambiguity refl ected the fact that
sequencing was still suffi ciently expensive that some information was
better than none. Such a code is already a signifi cant abstraction from
a “physical” DNA molecule, which might contain nonstandard nucle-
otides, epigenetic markers, three-dimensional conformations, or other
chemical variations. 9
Despite the standard AGTC shorthand, variation in coding was still
problematic for bioinformaticians. Different software programs, for
example, required sequences to be submitted in different formats. By
the late 1980s, however, the superior speed of Bill Pearson and David
Lipman's FASTA program meant that its input format gradually became
a standard for the encoding of sequences. Even though FASTA was su-
perseded by BLAST, the FASTA format persisted. 10 The format consists
of a fi rst line that begins with a “>” on which information identifying
the sequence is listed, followed by the sequence itself on subsequent
lines. The appearance of another “>” in the fi le indicates the beginning
of a new sequence. Although FASTA has been criticized because the
identifi cation line lacks a detailed structure, it remains the de facto stan-
dard for sharing and transferring sequence data. Its simplicity makes
it particularly attractive to programmers who want to be able to parse
sequence data into their programs quickly and easily. It is a simple but
powerful data structure to which sequences must conform.
The ready existence of a widely agreed-upon code has made the
sharing of sequence data relatively straightforward. The sharing of
other kinds of biological data, however, has required more elaborate
schemes. In particular, it is what are known as “annotation data” that
have caused the greatest problems. Annotation data include all the in-
formation attached to a sequence that is used to describe it: its name,
its function, its origin, publication information, the genes or coding re-
gions it contains, exons and introns, transcription start sites, promoter
regions, and so on. The problem is that such data can be stored in a
variety of ways; different descriptions in natural language (for example:
“Homo sapiens,” “H. sapiens,” “homo sapiens,” “Homo_sapiens”), dif-
ferent coordinate systems, and different defi nitions of features (for in-
stance, how one defi nes and delimits a gene) inhibit compatibility and
interoperability. There are two kinds of solutions to these problems, one
of which I will call “centralized” and the other “democratic.”
The democratic approach is known as a distributed annotation sys-
Search WWH ::




Custom Search