Data Mining Modeling, Analysis, and Scoring Processes
In Section 3.1.4, we discussed the CRISP-DM modeling phase at a
relatively high level. In this section, we explore the modeling process
in more detail, as well as process details for assessing supervised
model quality and applying models to new data. In JDM, we charac-
terize these activities as data mining tasks .
In model building, we start with a dataset—a collection of cases—
where each case typically corresponds to a record and has a set of
attribute values. A case can be data we have collected on a customer,
house, disease, or anything that we wish to understand better
through data mining. The amount of data required for mining varies
depending on the algorithm and the nature of the problem. For
example, in a clinical trial to assess health improvement, there may
be only 200 cases, one for each participating patient. On the other
hand, a company may have a database of 10 million customers and
want to segment these customers using a clustering algorithm. Simi-
larly, some problems may have very few attributes, such as observ-
able traits of mushrooms, while others may have thousands of
attributes, such as the output of a microarray chip exploring the
To extract knowledge or patterns from data, we begin with a
dataset, called the build data, as illustrated in Figure 3-7. Depending
on requirements of the data mining engine (DME) or the problem to be
solved, the data may be sampled and transformed, producing a
transformed dataset ready for model building.
The process of model building requires not only the data, but also
a group of settings that tell the DME what type of model to build, for
example, a classification model with a particular target attribute. The
settings may include what algorithm to use, among other settings.
The output of this process is a model— a compact representation of
the knowledge or patterns contained in the data. Depending on the
mining function and algorithm, the model can then be used to make
predictions, or inspected to understand the knowledge or patterns
found in the data.