Biology Reference
In-Depth Information
data more amenable to integration and federation with other data types.
The NCBI data model was the work of James Ostell, who had been
asked by Lipman to join NCBI as its chief of information engineering in
1988. 79 Ostell needed to solve two problems. The fi rst was how to make
data available to the widest possible number of biological users by en-
suring that they could be shared across different computer platforms.
Ostell's solution was to adopt an international standard (IS08824 and
IS08825) called ASN.1 (Abstract Syntax Notation 1). Like the hyper-
text transfer protocol (HTTP) used on the Internet, ASN is a way for
computers to communicate with one another—it specifi es rules for de-
scribing data objects and the relationships between them. Unlike HTTP,
however, it is not text-based, but renders data into binary code. ASN.1
was developed in 1984 for the purpose of structuring email messages; it
describes in bits and bytes the layout of messages as they are transmit-
ted between programs or between different computers. ASN.1 acts as
a universal grammar that is completely independent of any particular
machine architecture or programming language. 80 Ostell chose ASN.1
because “we did not want to tie our data to a particular database tech-
nology or a particular programming language.” 81 Using ASN.1 meant
that biologists using any programming language or computer system
could use the GenBank database.
The second problem was to fi nd a way of storing various kinds of
data in a form that was suited to the needs of biologists who wanted not
just DNA sequence information, but also data about protein sequence,
protein structure, and expression, as well as information contained in
the published literature. The scale of this problem of “heterogeneous
sources” had become such that relational databases were no longer ap-
propriate for such linking. “It is clear that the cost of having to stay
current on the details of a large number of relational schemas makes
this approach impractical,” Ostell argued. “It requires a many-to-many
mapping among databases, with all the frailties of that approach.” 82
In other words, keeping the structure of each database consistent with
the structure of a large number of others would quickly prove an im-
possible task. The alternative was to fi nd a way to link the databases
using ASN.1 via what Ostell called a “loose federation.” The fi rst such
application, which became known as Entrez, used ASN.1 to link nucleic
acid databases, protein sequence databases, and a large database of bio-
medical literature (MEDLINE). Wherever an article was cited in a se-
quence database (for instance, the publication from which the sequence
was taken), the NCBI created a link to the relevant article in MED-
LINE using the MEDLINE ID (fi gure 5.2). Likewise, NCBI created links
Search WWH ::




Custom Search