Biomedical Engineering Reference
In-Depth Information
Where is the knowledge we have lost in information?
Where is the wisdom we have lost in knowledge?
T.S. Elliot, "The Rock"
Getting at the hard-won sequence and structure data in molecular biology databases and the
functional data in the online biomedical literature is complicated by the size and complexity of the
databases. Often, it's assumed—sometimes incorrectly—that certain data are contained in a
database. However, exhaustively searching for the raw data and performing the transformation and
manipulations on the data through manual operations is often impractical. Similarly, in cases where it
isn't certain what relationships can be garnered from searching through a database, the odds of
finding every biologically relevant relationship through manually authored query statements are low.
When it's known in general what resides in a database and there is a need to extract it, the challenge
is more of a translation problem. Conversely, when very little is known about what resides in the
database, the work is primarily data discovery. In either case, the time and computational resources
required to locate and manipulate the data are limiting factors.
Camouflaged by the size and complexity of a database, the millions of data points from genomic or
proteomic studies are of little value. Only when these data are categorized according to a meaningful
theme are they useful in furthering our understanding of sequence, structure, or function. Regardless
of whether this categorization is at the base pair, chromosome, or gene level, an organizing theme is
critical because it simplifies and reduces the complexity of what could otherwise be a flood of
incomprehensible data. For example, the individual databases managed by the NCBI represent
generally recognizable organizational themes that facilitate use of their contents. At a higher level,
our understanding of health and disease is facilitated by the organization of clinical research data by
organ system, pathogen, genetic aberration, or site of trauma.
Ideally, the creator and the users of the database share an understanding of the underlying
organizational theme. These themes, and the tools used to support them, determine how easily
databases created for one purpose can be used for other purposes. For example, in a relational
database of gene sequences, the data may be arranged in tables, and the user may need to
construct Structured Query Language (SQL) statements to search for and retrieve data. However, if
inherited diseases organize the relational database, it may not readily support an efficient search by
protein sequence.
The challenge for researchers looking in the exponentially increasing quantities of microbiology data
for assumed and unknown relationships can be formidable, even if the number of data elements and
dimensionality are relatively small. For example, a relational database with a few hundred records
(rows) and a small number of fields per record (low dimensionality) can probably be searched
manually for new interrelationships in the data. However, the task may involve creating relatively
complicated, computationally intensive joins in order to create views that support a given hypothesis
of how data are related. In addition, even within a relatively small database, it may be practically
impossible to specify a relationship query exactly. At issue is how best to support the formulation of a
hypothesis-based query. In addition, even if the technology is available that allows a researcher to
specify any hypothetical query, the potential for discovering new relationships in data is a function of
the insights and biases imposed by the researcher. While these limitations may be problematic in
relatively small databases, they may be intolerable in databases with billions of interrelated data
elements.
To avoid the computational constraints imposed by these large molecular biology databases,
researchers frequently turn to biological heuristics to avoid exhaustive searches or processes with a
low likelihood of success. For example, in hunting for new genes, a good place to start, from a
Search WWH ::




Custom Search