Ordering Objects - Life Out of Sequence

Biology Reference

In-Depth Information

data more amenable to integration and federation with other data types.

The NCBI data model was the work of James Ostell, who had been

asked by Lipman to join NCBI as its chief of information engineering in

1988. 79 Ostell needed to solve two problems. The fi rst was how to make

data available to the widest possible number of biological users by en-

suring that they could be shared across different computer platforms.

Ostell's solution was to adopt an international standard (IS08824 and

IS08825) called ASN.1 (Abstract Syntax Notation 1). Like the hyper-

text transfer protocol (HTTP) used on the Internet, ASN is a way for

computers to communicate with one another—it specifi es rules for de-

scribing data objects and the relationships between them. Unlike HTTP,

however, it is not text-based, but renders data into binary code. ASN.1

was developed in 1984 for the purpose of structuring email messages; it

describes in bits and bytes the layout of messages as they are transmit-

ted between programs or between different computers. ASN.1 acts as

a universal grammar that is completely independent of any particular

machine architecture or programming language. 80 Ostell chose ASN.1

because “we did not want to tie our data to a particular database tech-

nology or a particular programming language.” 81 Using ASN.1 meant

that biologists using any programming language or computer system

could use the GenBank database.

The second problem was to fi nd a way of storing various kinds of

data in a form that was suited to the needs of biologists who wanted not

just DNA sequence information, but also data about protein sequence,

protein structure, and expression, as well as information contained in

the published literature. The scale of this problem of “heterogeneous

sources” had become such that relational databases were no longer ap-

propriate for such linking. “It is clear that the cost of having to stay

current on the details of a large number of relational schemas makes

this approach impractical,” Ostell argued. “It requires a many-to-many

mapping among databases, with all the frailties of that approach.” 82

In other words, keeping the structure of each database consistent with

the structure of a large number of others would quickly prove an im-

possible task. The alternative was to fi nd a way to link the databases

using ASN.1 via what Ostell called a “loose federation.” The fi rst such

application, which became known as Entrez, used ASN.1 to link nucleic

acid databases, protein sequence databases, and a large database of bio-

medical literature (MEDLINE). Wherever an article was cited in a se-

quence database (for instance, the publication from which the sequence

was taken), the NCBI created a link to the relevant article in MED-

LINE using the MEDLINE ID (fi gure 5.2). Likewise, NCBI created links

Search WWH ::

Custom Search

Home