Biomedical Engineering Reference
In-Depth Information
organized according to the user's likely needs. Like a data repository, a data mart has a narrow focus
on data that is specific to a particular research project or task. That is, a data mart contains a subset
of the data contained in other databases as opposed to an indiscriminate mass copying of all the data
from another database. The major difference between a data mart and a data repository is that a
data mart contains data extracted or mirrored—copied in real time—from multiple application
databases.
One step up from the data mart is the data warehouse, a central database, frequently very large,
that can provide authenticated researchers with access to all of an institution's information. That is, a
data warehouse is usually populated with data from a variety of non-compatible sources, such as
sequencing machines, clinical systems, or national genomic databases. Because a data warehouse
combines data from a variety of application-oriented databases into a single system, data from
disparate sources must be cleaned, encoded, and translated so that a standard set of analytical tools
can be used with the data. Furthermore, the data in a data warehouse are nonvolatile in that new
data are appended to the database and never replace existing data. In addition, the data warehouse
is considered time-variant in that the data are time-stamped.
The data warehouse is also distinguished from application-specific databases in the way the data
destined for incorporation in a data warehouse are selected, prepared, and loaded, and how the
underlying database is optimized for use. Once data to be included in a data warehouse have been
identified, the data are cleaned, merged, and the original database structures are manipulated to
mirror those of the data warehouse. For example, data redundancy may be intentionally built-in to
the data warehouse architecture, thereby minimizing the processing required for a typical query,
which in turn maximizes the efficiency of the underlying database engine.
It's important to note that when the specialized vocabulary is peeled away, data repositories, data
marts, and data warehouses are simply databases. The three architectures share the usual issues of
database design, provision for maintenance, security, and periodic modification. Similarly, data
repositories, data marts, and data warehouses are built with some form of a database management
system, a program that allows researchers to store, process, and manage data in a systematic way.
One of the uses of a fully functional data warehouse or data mart is that it supports data mining—the
process of extracting meaningful relationships from usually very large quantities of seemingly
unrelated data. Specialized data-mining tools allow researchers to perform complex analyses and
predictions on data. A prerequisite to data mining and the archiving process in general is the
availability of a controlled vocabulary that provides a single term for a given concept. This controlled
vocabulary is most often implemented as part of a data dictionary—a program that maps or
translates identical concepts that are expressed in different words, phrases, or units into a single
vocabulary. A popular controlled vocabulary is the Medical Subject Heading (MeSH), maintained by
the U.S. National Library of Medicine, and used with the government-sponsored PubMed biomedical
literature database.
Related to the concept of databases is the data archive—a non-volatile holder for data that are
infrequently accessed—that is optimized for data recovery and data longevity. Strictly speaking, an
archive needn't be a database. Archives are commonly made on multi-gigabyte tape cartridges that
are stored offsite in environmentally controlled conditions to minimize the chances of data loss.
Armed with these core definitions, the reader can proceed with this chapter, which considers
databases from a functional, data-management perspective before exploring the core technologies.
Search WWH ::




Custom Search