Biomedical Engineering Reference
In-Depth Information
What's more, there is no guarantee that the data in the data warehouse will be sufficient to support
the desired data-mining activities. Additional data may be needed from the source databases, which
then must also be cleaned, transformed, and stored, activities that obviate the time advantage of the
data warehouse. One approach to guarding against this eventuality is to incorporate more data into
the data warehouse when it is built, at the cost of increased complexity and size, with no guarantee
that any of the additional data in the warehouse will ever be used in mining activities.
The primary advantage of using a database approach to data mining is that resources are used on an
as-needed basis. Only those data from the separate databases that are involved with a specific data-
mining operation are processed. Although it may take days or weeks in order to arrange for the
appropriate processing in preparation for data mining, the resources required for just-in-time data
mining are generally much less than those associated with data warehousing.
Regardless of the data source, knowledge discovery is an iterative process that involves feedback at
each stage, as illustrated in Figure 7-1 . This feedback can be used programmatically or can serve as
the basis for human decision-making. For example, if the preprocessing and cleaning of data from a
data warehouse results in an insufficient quantity of cleaned data, or inappropriate data altogether,
then the researcher may redefine the selection and sampling criteria to include more or different
data.
Although the methodology seems straightforward, data mining and the overall knowledge-discovery
process involve much more than the simple statistical analysis of data. For example, difficult-to-
describe metrics, such as novelty, interestingness, and understandability, are often used to define
data-mining parameters for data discovery. Similarly, each phase of the knowledge-discovery process
has associated challenges, as outlined here.
Selection and Sampling
Because of practical computational limitations and a priori knowledge, data mining isn't simply about
searching for every possible relationship in a database. In a large database or data warehouse, there
may be hundreds or thousands of valueless relationships. For example, a researcher interested in the
relationship of SNPs with clinical findings can reasonably ignore the zip code of the tissue donors or
the dates that the tissue samples were obtained. There are exceptions, of course, such as if there is a
concentration of a specific ethnicity in a geographical area defined by a zip code.
Because there may be millions of records involved and thousands of variables, initial data mining is
typically restricted to computationally tenable samples of the holding in an entire data warehouse.
The evaluation of the relationships that are revealed in these samples can be used to determine
which relationships in the data should be mined further using the complete data warehouse. With
large, complex databases, even with sampling, the computational resource requirements associated
with non-directed data mining may be excessive. In this situation, researchers generally rely on their
knowledge of biology to identify potentially valuable relationships and they limit sampling based on
these heuristics.
Preprocessing and Cleaning
The bulk of work associated with knowledge discovery is preparing the data for the actual analysis
associated with data mining. The major preparatory activities, listed in Table 7-1 , are normally
performed to some extent in the creation of a data warehouse. However, data mining may be
performed on one or more independent databases, or the data in the warehouse may not have been
cleaned initially, at least to the degree necessary for optimum data-mining results. In either case,
these activities need to be performed as part of the preprocessing and cleaning phase of the overall
knowledge-discovery process.
Table 7-1. Data Mining Preparatory Activities.
Search WWH ::




Custom Search