Databases - Bioinformatics Computing

Biomedical Engineering Reference

In-Depth Information

organized according to the user's likely needs. Like a data repository, a data mart has a narrow focus

on data that is specific to a particular research project or task. That is, a data mart contains a subset

of the data contained in other databases as opposed to an indiscriminate mass copying of all the data

from another database. The major difference between a data mart and a data repository is that a

data mart contains data extracted or mirrored—copied in real time—from multiple application

databases.

One step up from the data mart is the data warehouse, a central database, frequently very large,

that can provide authenticated researchers with access to all of an institution's information. That is, a

data warehouse is usually populated with data from a variety of non-compatible sources, such as

sequencing machines, clinical systems, or national genomic databases. Because a data warehouse

combines data from a variety of application-oriented databases into a single system, data from

disparate sources must be cleaned, encoded, and translated so that a standard set of analytical tools

can be used with the data. Furthermore, the data in a data warehouse are nonvolatile in that new

data are appended to the database and never replace existing data. In addition, the data warehouse

is considered time-variant in that the data are time-stamped.

The data warehouse is also distinguished from application-specific databases in the way the data

destined for incorporation in a data warehouse are selected, prepared, and loaded, and how the

underlying database is optimized for use. Once data to be included in a data warehouse have been

identified, the data are cleaned, merged, and the original database structures are manipulated to

mirror those of the data warehouse. For example, data redundancy may be intentionally built-in to

the data warehouse architecture, thereby minimizing the processing required for a typical query,

which in turn maximizes the efficiency of the underlying database engine.

It's important to note that when the specialized vocabulary is peeled away, data repositories, data

marts, and data warehouses are simply databases. The three architectures share the usual issues of

database design, provision for maintenance, security, and periodic modification. Similarly, data

repositories, data marts, and data warehouses are built with some form of a database management

system, a program that allows researchers to store, process, and manage data in a systematic way.

One of the uses of a fully functional data warehouse or data mart is that it supports data mining—the

process of extracting meaningful relationships from usually very large quantities of seemingly

unrelated data. Specialized data-mining tools allow researchers to perform complex analyses and

predictions on data. A prerequisite to data mining and the archiving process in general is the

availability of a controlled vocabulary that provides a single term for a given concept. This controlled

vocabulary is most often implemented as part of a data dictionary—a program that maps or

translates identical concepts that are expressed in different words, phrases, or units into a single

vocabulary. A popular controlled vocabulary is the Medical Subject Heading (MeSH), maintained by

the U.S. National Library of Medicine, and used with the government-sponsored PubMed biomedical

literature database.

Related to the concept of databases is the data archive—a non-volatile holder for data that are

infrequently accessed—that is optimized for data recovery and data longevity. Strictly speaking, an

archive needn't be a database. Archives are commonly made on multi-gigabyte tape cartridges that

are stored offsite in environmentally controlled conditions to minimize the chances of data loss.

Armed with these core definitions, the reader can proceed with this chapter, which considers

databases from a functional, data-management perspective before exploring the core technologies.

Bioinformatics Computing

Search WWH ::

Custom Search

Home