Database Reference
In-Depth Information
1. Understanding the problem domain
In this step one works closely with domain experts to define the problem and
determine the project goals, identify key people, and learn about current
solutions to the problem. It involves learning domain-specific terminology. A
description of the problem, including its restrictions, is done. The project
goals must be translated into the DMKD goals and may include initial
selection of potential DM tools.
2. Understanding the data
This step includes collection of sample data and deciding which data will be
needed, including its format and size. If background knowledge exists, some
attributes may be ranked as more important. Next, we need to verify
usefulness of the data in respect to the DMKD goals. Data need to be checked
for completeness, redundancy, missing values, plausibility of attribute values,
and the like.
3. Preparation of the data
This is the key step on which the success of the entire knowledge discovery
process depends; it usually consumes about half of the entire project effort. In
this step, we decide which data will be used as input to the data mining tools
in Step 4. It may involve sampling of data, running correlation and
significance tests, cleaning data like checking for completeness of data
records and correcting for noise. The cleaned data can be further processed by
feature selection and extraction algorithms (to reduce dimensionality), by
derivation of new attributes (say by means of discretization), and by
summarization of data (data granularization). The result is new data records,
meeting specific input requirements for the planned, to-be-used DM tools.
4. Data mining
This is another key step in the knowledge discovery process. Although it is
the data mining tools that discover new information, their application usually
takes less time than data preparation. This step involves usage of the planned
data mining tools, and selection of the new ones if needed. Data mining tools
include many types of algorithms, such as rough and fuzzy sets, Bayesian
methods, evolutionary computing, machine learning, neural networks,
clustering, and preprocessing techniques. Detailed descriptions of these
algorithms and their applications can be found in [17]. Description of data
summarization and generalization algorithms can be found in [22]. This step
involves the use of several DM tools on data prepared in Step 3. First,
however, the training and testing procedures need to be designed and the data
model is constructed using one of the chosen DM tools; the generated data
model is then verified using testing procedures.
One of the major difficulties in this step is that many commonly used tools
may not scale up to a huge volume of data. Scalable DM tools are
characterized by a linear increase of their run time with the increase of the
number of data points within a fixed amount of available memory. Most of
the DM tools are not scalable but there are examples of tools that scale well
with the size of the input data: clustering [11], [32], [78]; machine learning
[34], [63]; association rules [2], [3], [70]. An overview of scalable DM tools
Search WWH ::




Custom Search