Towards a Methodology for Data Mining Project Development: The Importance of Abstraction - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

knowledge into data mining problem definition and a preliminary plan

to achieve the objectives.

•

Data Understanding starts with an initial data collection and proceeds

with activities in order to get familiar with the data, to identify data qual-

ity problems, to discover first insights into the data or to detect interesting

subsets to propose hypotheses for hidden information.

•

Data Preparation constructs the final dataset from the initial raw data.

Data preparation tasks are likely to be performed multiple times and not

in any prescribed order. Tasks include table, record and attribute selection

as well as transformation and cleaning of data for modelling tools.

•

Modelling techniques are selected and applied and their parameters are

calibrated to optimal values. There are several techniques for the same

data mining problem type that have some different data requirements.

•

Evaluation evaluates the model and review the steps executed to construct

the model to be certain it properly achieves the business objectives.

•

Deployment presents the knowledge in a way that the customer can use it.

It often involves applying models within an organization's decision making

processes.

At 1999 SAS Institute proposed the SEMMA [27] methodology integrated

by five phases: Sample, Explore, Modify, Model and Assess. The data min-

ing process starts by taking a representative sample of the target population

to which a confidence level is associated. Then, this sample is explored and

analyzed using visualization and statistical tools in order to obtain a set of

significant variables that will become the input for a selected model. The se-

lected model is analyzed. The goal of this step is to determine relationships

among variables. In this phase, both statistical methods (e.g. discriminant

analysis, clustering, and regression analysis) and data-oriented methods (e.g.

neural networks, decision trees, association rules) can be used. The final phase

in this process consists of evaluating the model and comparing it with differ-

ent statistical methods and samples. On the other hand, Clementine proposes

CATs [29] (Clementine Application Templates) as application specific libraries

that follow the CRISP-DM standard, being each CAT stream assigned to a

CRISP-DM phase.

All of the above models depend heavily on the analysts (business, domain

experts, data miners) knowledge. There seems to exist a need for an interme-

diate level of conceptualization which can provide an interface between the

experts and the clients.

According to Grossman et al. [12] “although efforts have been done to ho-

mogenize terminology and concepts among standards more work is required”.

A framework to develop a unified model for data mining is proposed in [10].

The goal of the model is to provide a uniform data structure for all data

mining patterns and operators to manipulate them. The model is designed

under a three-view architecture (Process view, model view and data view)

that includes a process model and data views. The model view contains a set

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home