Biomedical Engineering Reference
In-Depth Information
access time. That is, in a database design that doesn't take likely use patterns into account,
performance suffers. A large amount of processor time will be spent extracting information from the
system as the database program performs joins and other operations. This performance penalty is a
reason for not simply polling application databases for data. It's far better, from a performance
perspective, to move the data into a separate data repository, a second database that is optimized
for the desired searching and analysis.
The attraction of the ubiquitous relational model is that it is mature, stable, reliable, well understood,
and well suited for a number of different applications in bioinformatics. The basic concepts involved
with the relational model are easily grasped; data are populated into rows and columns in a table,
and tables are associated with one another by joining fields that match in the two tables. However,
the relational model has several limitations. Because the relational model is based on rows and
columns, it's most efficient working with scalar data such as names, addresses, and laboratory
values. That is, all relationships between objects must be based on data values as opposed to a
location or place-holder in the database. This limitation often requires the database designer to
create additional relations to describe logical associations between data elements. For example, in a
relational database containing both nucleotide and amino acid sequences, the researcher can't relate
the two without the aid of tables that relate nucleotide sequences to proteins and protein sequences
to specific amino acids.
An even greater limitation of the relational model from a bioinformatics perspective is that the
metaphor of rows and columns often isn't a natural fit for sequence or protein shape data. Recall that
one reason for using a DBMS is to allow users to think of data management in abstract, high-level
terms, instead of the underlying algorithms and data representation schemes. Although tables of
rows and columns can be considered a simplification over hard disk platters, they can seem obtuse to
a researcher working with thousands of sequences, genes, and other data that don't fit neatly into a
tabular metaphor. That is, the relational model often doesn't hide the complexity of genomic data. As
a result, various other data models are used by professionals in the biotech industry.
One alternative to the relational model is the hierarchical model, which predates the relational model
by a decade. Unlike the flexible relational model, permanent hierarchical connections are defined
when the database is created. Within the hierarchical database model, the smallest data entity is the
record. That is, unlike records in a relational model, records within a hierarchical database are not
necessarily broken up into fields. In addition, connections within the hierarchical model don't depend
on the data. The hierarchical links, sometimes called the structure of the data, can best be thought of
as forming an inverted tree, with the parent file at the top and children files below. The relationship
between parent and children is a one-to-many connection, in that one parent may produce multiple
children.
The basic operation on the hierarchical database is the tree walk, proceeding from parent to child.
Data can be retrieved only by traversing the levels of the hierarchy according to the path defined by
the succession of parent fields. This unidirectional convention causes certain relationships to be
difficult to extract from the database, even though they may be explicit in the data. For example, one
characteristic of the hierarchical model is that information must often be repeated. Returning to the
author-subject database example, under the topic of neurofibromatosis, if an author wrote more than
one paper on the subject, the author's name and contact information would be repeated throughout
the database.
The hierarchical model was once very popular in medicine, in the form of the Massachusetts General
Hospital Utility Multi-Programming System (MUMPS) database language, which was used to develop
one of the first electronic medical record (EMR) systems. A reason for the initial popularity of MUMPS
in the early 1960s was that the data model is a good fit for clinical data, which tends to follow a
standard topic outline, which is hierarchical. For example, patients at the top of the hierarchy have
child nodes containing the elements of the EMR, including chief complaint, diagnosis, and laboratory
results, as defined in Table 2-2 . The limitation, noted earlier, is that for every patient admission,
certain data must repeated, such as the patient's address, billing information, and other demographic
information.
The hierarchical model remains significant in bioinformatics if only because a library of clinical
Search WWH ::




Custom Search