JDM distinguishes between data that is prepared and data that is
unprepared . Data miners may specify that their data is already pre-
pared, perhaps through various extraction, transformation, and load
(ETL) tools, and that the data mining tool should not transform it fur-
ther. For example, if a user already normalized a data attribute—per-
haps the range of attribute age between 10 and 90 has been mapped
to values between 0 and 1—the data mining tool typically should not
normalize it again. Alternatively, users may specify that some data
attributes are unprepared, meaning that the tool should perform
transformations it deems appropriate. JDM 2.0 further extends sup-
port for data preparation by including a framework and an explicit
interface for performing common data mining transformations.
Once a dataset is sufficiently prepared, the modeling phase begins.
Practitioners often consider this phase the “fun part.” Here, the user
gets to specify settings for mining functions, and if more control is
desired, the user can further select algorithms and their specific set-
tings for building models. These settings can be automatically tuned
by the data mining tool, or tuned explicitly by the user. Since there
are many possible algorithms or techniques for a given problem,
users may try several to determine which produces the best result.
Some mining algorithms may have specific data preparation require-
ments. As such, users may switch back and forth between the model-
ing and data preparation phase.
Also included in the modeling phase is model assessment.
Normally, a data mining tool will produce some model for almost any
data thrown at it, whether or not there are any meaningful knowl-
edge or patterns in the data. To safeguard against this, users can test
supervised models, that is, those supporting classification and
regression. On unsupervised models, like association and clustering,
users can inspect the models to determine if the results are meaning-
ful. For example, are the clusters defined in a clustering model help-
ful in understanding customer segments, or are these segments
different enough to develop a marketing strategy around them? We
explore the details of the modeling phase further in Section 3.3.
JDM provides extensive support for the modeling phase. For
those users new to data mining, they can specify problems at the
mining function level. In this case, the data mining tool is responsible
for selecting an appropriate algorithm and corresponding algorithm