Java Reference
In-Depth Information
(Data Mining)
Figure 1-3
A workflow involving data mining.
There is a comprehensive, recognized process for data mining—
CRISP-DM—which we cover in detail in Chapter 3. For now, we can
consider a simplified process that begins with defining the problem
and its objectives, identifying data for mining, and assessing data
quality. The availability of data for mining is not the same thing as appro-
priateness of data for mining. If the data is dirty (i.e., contains errors and
inconsistencies), it likely must first be cleaned. Note that the adage
“garbage in, garbage out” is most applicable to data mining.
This data is then transformed as required by the data mining tool
and/or according to the creativity of the data miner. Transformations
include, for example, replacing misspelled values with correct ones,
identifying outlier values, and create attributes derived from other
attributes. The knowledge extraction process continues with mining
the transformed data to produce a data mining model, which is then
evaluated for quality and relevance to the problem's objectives. The
knowledge extraction step could involve the labor of dozens of statis-
ticians in a back room crunching numbers, or a data mining
algorithm iterating over the data to produce a model of the data.
The model itself may be used directly to understand, for example,
customer segments, or what the factors are that most influence
customers to accept an offer. The model may also be used to generate
scores (i.e., make predictions or classifications). Scoring can be per-
formed in batch (i.e., all at once over a given dataset such as a large
customer dataset), or integrated into applications for real-time
scoring such as in call center applications or online retail product
Solving business and scientific problems often requires many
components in a complex process flow—for example, customer inter-
action, data collection and staging, data analysis and summarization,
report generation and distribution, decision making, and deploy-
ment. As such, data mining does not exist by itself but is often inte-
grated with a business process to provide value.
Operational systems collect data, typically in relational databases,
that is then cleaned and staged into the corporate data warehouse.
Search WWH ::

Custom Search