Biology Reference
In-Depth Information
formation from the published literature (it was often necessary to read
an entire article or even several articles) and keeping it up to date was a
gigantic task. But biologists often wanted to use the database to retrieve
and aggregate information located across many entries. For instance, a
biologist might want to fi nd all the protein-coding sequences in the da-
tabase that contained exons with a size greater than 100 kilobases. An
excerpt from a long list of criticisms of GenBank reads:
The BB&N [GenBank] retrieval system is not suited to this
scientifi c area. Modern systems permit the user to construct
current lists of entries retrieved on various criteria and to per-
form manipulations on these sequences. The organization of
the BB&N system is archaic, because it does not readily permit
these manipulations. 52
The fl at fi le and features table were not well adapted to sophisticated
cross-entry queries. Moreover, as biologists produced more and more
sequence, it was inevitable that sequences began to overlap; in order for
this work to be useful, the database had to identify such overlaps and
organize the data in a way that represented these fragments. Another
user wrote to Los Alamos complaining that the fl at-fi le data format was
not always consistent enough to be computer readable and suggesting
“a language for reliably referring to sections of other entries in the da-
tabase. If this language is suffi ciently powerful, many of the synthetic
sequences could be expressed in this form.” 53 In other words, the user
wanted the database to be organized so as to allow the linkages between
different entries and different sequences to be made manifest.
The result of these demands was that GenBank was unable to keep
pace with the publication of sequences, and particularly with the kinds
of annotations that were supposed to appear in the Features table. By
1985, it took an average of ten months for a published sequence to
appear in the database. This was not only an unacceptably long delay
from the point of view of researchers, but also stood in breach of Gen-
Bank's contract with the NIH (which required sequences to be available
within three months). A progress report from early 1985 explained the
problem:
Since the inception of GenBank . . . there has been a rapid in-
crease in both the rate at which sequence data is reported and
in the complexity of related information that needs to be anno-
tated. As should be expected, many reported sequences repeat,
Search WWH ::




Custom Search