Data Mining Process - Java Data Mining: Strategy, Standard, and Practice

Java Reference

In-Depth Information

3.3

Data Mining Modeling, Analysis, and Scoring Processes

In Section 3.1.4, we discussed the CRISP-DM modeling phase at a

relatively high level. In this section, we explore the modeling process

in more detail, as well as process details for assessing supervised

model quality and applying models to new data. In JDM, we charac-

terize these activities as data mining tasks .

3.3.1

Model Building

In model building, we start with a dataset—a collection of cases—

where each case typically corresponds to a record and has a set of

attribute values. A case can be data we have collected on a customer,

house, disease, or anything that we wish to understand better

through data mining. The amount of data required for mining varies

depending on the algorithm and the nature of the problem. For

example, in a clinical trial to assess health improvement, there may

be only 200 cases, one for each participating patient. On the other

hand, a company may have a database of 10 million customers and

want to segment these customers using a clustering algorithm. Simi-

larly, some problems may have very few attributes, such as observ-

able traits of mushrooms, while others may have thousands of

attributes, such as the output of a microarray chip exploring the

human genome.

To extract knowledge or patterns from data, we begin with a

dataset, called the build data, as illustrated in Figure 3-7. Depending

on requirements of the data mining engine (DME) or the problem to be

solved, the data may be sampled and transformed, producing a

transformed dataset ready for model building.

The process of model building requires not only the data, but also

a group of settings that tell the DME what type of model to build, for

example, a classification model with a particular target attribute. The

settings may include what algorithm to use, among other settings.

The output of this process is a model— a compact representation of

the knowledge or patterns contained in the data. Depending on the

mining function and algorithm, the model can then be used to make

predictions, or inspected to understand the knowledge or patterns

found in the data.

Search WWH ::

Custom Search

Home