Information Technology Reference
In-Depth Information
the nontrivial process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data. [10]
In this context pattern is meant in a very general way. A pattern is whatever
a data mining algorithm may find in or generate from the data, e.g. a model
that scores customers based on a decision tree or based on a neural network,
a clustering of the data, or a set of association rules. Whereas the demand for
validity, novelty, usefulness and understandability of these patterns is ultimately
clear, the implications of the term “nontrivial process” might not be obvious at
first glance and are worth a deeper look.
3.1 The Phases of the KDD Process
A KDD process consists of several tasks. Indeed, the actual mining, that is
to say the application of a data mining algorithm to a dataset, is only one of
these steps. Following the CRISP-Data Mining model [9,31] we distinguish the
following tasks:
1. Business Understanding
The very first step of a KDD project should be a close look from the business
point of view. The goal of this phase is to gain a deeper understanding of
the project objectives and further circumstances strictly from the business
perspective. Finally the insights from this initial phase are to be turned into
a data mining problem definition.
2. Data Understanding
Based on the results from the business point of viewthe second step is to
get familiar with the available data. The goal is to understand the attributes
and the corresponding attribute values and to find out hidden semantics
possibly in the data. Furthermore at this stage one should figure out what
exactly the available data offers. That is to say, whether it has the potential
to answer our mining questions or not, and if possible to select promising
subsets of the data.
3. Data Preparation
The next step is to construct the dataset where the mining algorithm is to be
run on. This phase covers both syntactic aspects - format transformations
for the employed mining algorithm - and semantic aspects like table, record
and attribute selection. Last but not least this phase also includes deriving
newattributes that contain higher information only implicitly contained in
the rawdata (e.g. deriving “day of the week” from “date”).
4. Modeling (or Mining)
In the modeling phase the actual data mining takes place. Based on the iden-
tified business goals and the assessment of the available data an appropriate
mining algorithm is chosen and run on the prepared data.
5. Evaluation
Evaluating the results of the mining run mainly covers three aspects. First of
all, it is necessary to ensure whether everything went right from the technical
point of view. Was the mining algorithms finally able to read and interpret
the prepared dataset correctly? Were all designated information actually
Search WWH ::




Custom Search