Biomedical Engineering Reference
In-Depth Information
The relative timing of sequences in the knowledge-discovery process depends on whether the source
of data is a data warehouse or one or more separate databases. A data warehouse is a central
database in which data have been combined from a variety of non-compatible sources, such as
sequencing machines, clinical systems, textual bibliographic databases, or national genomic
databases. In the process of combining data from disparate sources, the data are selected, cleaned,
and transformed to support user-driven analytical and data-driven mining tools.
Whereas a data warehouse is a ready store of data to be mined at any time, using separate
databases requires much more work on an as-needed basis. The processing up to the point of data
mining may take hours or weeks, depending on the complexity and size of the databases involved in
the process.
The advantage of using a data warehouse approach to data mining is timesavings. Assuming that
everything needed for data mining is available in the data warehouse, a typical mining operation may
be able to be completed in a matter of hours, depending on the processing power available, the size
of the data warehouse, and the complexity of the mining operation.
However, this ability to begin mining operations at any time comes at a cost. A data warehouse that
is capable of efficiently supporting data mining is significantly larger and the associated data
processing takes much longer than in a simple database, one designed to provide a central, unified
data repository that can be accessed through a single user interface. The reason for the increased
data warehouse size and increase in complexity of associated processing is the increasingly fine-
grained data required for data-mining support, as well as the need to incorporate contextual or
metadata to support the data-mining process. For example, data mining requires a controlled
vocabulary, usually implemented as part of a data dictionary, so that a single word can be used to
express a given concept. Similarly, the extra attention to cleaning the data and other processing is
necessary to maximize the odds that the conclusions based on data mining are valid.
Search WWH ::




Custom Search