Databases Reference
In-Depth Information
knowledge into data mining problem definition and a preliminary plan
to achieve the objectives.
Data Understanding starts with an initial data collection and proceeds
with activities in order to get familiar with the data, to identify data qual-
ity problems, to discover first insights into the data or to detect interesting
subsets to propose hypotheses for hidden information.
Data Preparation constructs the final dataset from the initial raw data.
Data preparation tasks are likely to be performed multiple times and not
in any prescribed order. Tasks include table, record and attribute selection
as well as transformation and cleaning of data for modelling tools.
Modelling techniques are selected and applied and their parameters are
calibrated to optimal values. There are several techniques for the same
data mining problem type that have some different data requirements.
Evaluation evaluates the model and review the steps executed to construct
the model to be certain it properly achieves the business objectives.
Deployment presents the knowledge in a way that the customer can use it.
It often involves applying models within an organization's decision making
processes.
At 1999 SAS Institute proposed the SEMMA [27] methodology integrated
by five phases: Sample, Explore, Modify, Model and Assess. The data min-
ing process starts by taking a representative sample of the target population
to which a confidence level is associated. Then, this sample is explored and
analyzed using visualization and statistical tools in order to obtain a set of
significant variables that will become the input for a selected model. The se-
lected model is analyzed. The goal of this step is to determine relationships
among variables. In this phase, both statistical methods (e.g. discriminant
analysis, clustering, and regression analysis) and data-oriented methods (e.g.
neural networks, decision trees, association rules) can be used. The final phase
in this process consists of evaluating the model and comparing it with differ-
ent statistical methods and samples. On the other hand, Clementine proposes
CATs [29] (Clementine Application Templates) as application specific libraries
that follow the CRISP-DM standard, being each CAT stream assigned to a
CRISP-DM phase.
All of the above models depend heavily on the analysts (business, domain
experts, data miners) knowledge. There seems to exist a need for an interme-
diate level of conceptualization which can provide an interface between the
experts and the clients.
According to Grossman et al. [12] “although efforts have been done to ho-
mogenize terminology and concepts among standards more work is required”.
A framework to develop a unified model for data mining is proposed in [10].
The goal of the model is to provide a uniform data structure for all data
mining patterns and operators to manipulate them. The model is designed
under a three-view architecture (Process view, model view and data view)
that includes a process model and data views. The model view contains a set
Search WWH ::




Custom Search