Databases - Bioinformatics Computing

Biomedical Engineering Reference

In-Depth Information

access time. That is, in a database design that doesn't take likely use patterns into account,

performance suffers. A large amount of processor time will be spent extracting information from the

system as the database program performs joins and other operations. This performance penalty is a

reason for not simply polling application databases for data. It's far better, from a performance

perspective, to move the data into a separate data repository, a second database that is optimized

for the desired searching and analysis.

The attraction of the ubiquitous relational model is that it is mature, stable, reliable, well understood,

and well suited for a number of different applications in bioinformatics. The basic concepts involved

with the relational model are easily grasped; data are populated into rows and columns in a table,

and tables are associated with one another by joining fields that match in the two tables. However,

the relational model has several limitations. Because the relational model is based on rows and

columns, it's most efficient working with scalar data such as names, addresses, and laboratory

values. That is, all relationships between objects must be based on data values as opposed to a

location or place-holder in the database. This limitation often requires the database designer to

create additional relations to describe logical associations between data elements. For example, in a

relational database containing both nucleotide and amino acid sequences, the researcher can't relate

the two without the aid of tables that relate nucleotide sequences to proteins and protein sequences

to specific amino acids.

An even greater limitation of the relational model from a bioinformatics perspective is that the

metaphor of rows and columns often isn't a natural fit for sequence or protein shape data. Recall that

one reason for using a DBMS is to allow users to think of data management in abstract, high-level

terms, instead of the underlying algorithms and data representation schemes. Although tables of

rows and columns can be considered a simplification over hard disk platters, they can seem obtuse to

a researcher working with thousands of sequences, genes, and other data that don't fit neatly into a

tabular metaphor. That is, the relational model often doesn't hide the complexity of genomic data. As

a result, various other data models are used by professionals in the biotech industry.

One alternative to the relational model is the hierarchical model, which predates the relational model

by a decade. Unlike the flexible relational model, permanent hierarchical connections are defined

when the database is created. Within the hierarchical database model, the smallest data entity is the

record. That is, unlike records in a relational model, records within a hierarchical database are not

necessarily broken up into fields. In addition, connections within the hierarchical model don't depend

on the data. The hierarchical links, sometimes called the structure of the data, can best be thought of

as forming an inverted tree, with the parent file at the top and children files below. The relationship

between parent and children is a one-to-many connection, in that one parent may produce multiple

children.

The basic operation on the hierarchical database is the tree walk, proceeding from parent to child.

Data can be retrieved only by traversing the levels of the hierarchy according to the path defined by

the succession of parent fields. This unidirectional convention causes certain relationships to be

difficult to extract from the database, even though they may be explicit in the data. For example, one

characteristic of the hierarchical model is that information must often be repeated. Returning to the

author-subject database example, under the topic of neurofibromatosis, if an author wrote more than

one paper on the subject, the author's name and contact information would be repeated throughout

the database.

The hierarchical model was once very popular in medicine, in the form of the Massachusetts General

Hospital Utility Multi-Programming System (MUMPS) database language, which was used to develop

one of the first electronic medical record (EMR) systems. A reason for the initial popularity of MUMPS

in the early 1960s was that the data model is a good fit for clinical data, which tends to follow a

standard topic outline, which is hierarchical. For example, patients at the top of the hierarchy have

child nodes containing the elements of the EMR, including chief complaint, diagnosis, and laboratory

results, as defined in Table 2-2 . The limitation, noted earlier, is that for every patient admission,

certain data must repeated, such as the patient's address, billing information, and other demographic

information.

The hierarchical model remains significant in bioinformatics if only because a library of clinical

Bioinformatics Computing

Search WWH ::

Custom Search

Home