Trends in Data Mining and Knowledge Discovery - Advanced Techniques in Knowledge Discovery and Data Mining

Database Reference

In-Depth Information

1. Understanding the problem domain

In this step one works closely with domain experts to define the problem and

determine the project goals, identify key people, and learn about current

solutions to the problem. It involves learning domain-specific terminology. A

description of the problem, including its restrictions, is done. The project

goals must be translated into the DMKD goals and may include initial

selection of potential DM tools.

2. Understanding the data

This step includes collection of sample data and deciding which data will be

needed, including its format and size. If background knowledge exists, some

attributes may be ranked as more important. Next, we need to verify

usefulness of the data in respect to the DMKD goals. Data need to be checked

for completeness, redundancy, missing values, plausibility of attribute values,

and the like.

3. Preparation of the data

This is the key step on which the success of the entire knowledge discovery

process depends; it usually consumes about half of the entire project effort. In

this step, we decide which data will be used as input to the data mining tools

in Step 4. It may involve sampling of data, running correlation and

significance tests, cleaning data like checking for completeness of data

records and correcting for noise. The cleaned data can be further processed by

feature selection and extraction algorithms (to reduce dimensionality), by

derivation of new attributes (say by means of discretization), and by

summarization of data (data granularization). The result is new data records,

meeting specific input requirements for the planned, to-be-used DM tools.

4. Data mining

This is another key step in the knowledge discovery process. Although it is

the data mining tools that discover new information, their application usually

takes less time than data preparation. This step involves usage of the planned

data mining tools, and selection of the new ones if needed. Data mining tools

include many types of algorithms, such as rough and fuzzy sets, Bayesian

methods, evolutionary computing, machine learning, neural networks,

clustering, and preprocessing techniques. Detailed descriptions of these

algorithms and their applications can be found in [17]. Description of data

summarization and generalization algorithms can be found in [22]. This step

involves the use of several DM tools on data prepared in Step 3. First,

however, the training and testing procedures need to be designed and the data

model is constructed using one of the chosen DM tools; the generated data

model is then verified using testing procedures.

One of the major difficulties in this step is that many commonly used tools

may not scale up to a huge volume of data. Scalable DM tools are

characterized by a linear increase of their run time with the increase of the

number of data points within a fixed amount of available memory. Most of

the DM tools are not scalable but there are examples of tools that scale well

with the size of the input data: clustering [11], [32], [78]; machine learning

[34], [63]; association rules [2], [3], [70]. An overview of scalable DM tools

Advanced Techniques in Knowledge Discovery and Data Mining

Search WWH ::

Custom Search

Home