Data Mining Process - Java Data Mining: Strategy, Standard, and Practice

Java Reference

In-Depth Information

do not know whether an unsupervised model is correct or not, only if

it is useful through manual inspection.

The notion of testing a supervised model is simple. Take a set of

cases with known outcomes and apply the model to those cases to

generate predictions. Compare the known, or actual, outcomes with

the predicted outcomes. Simplistically, the more predictions the model

gets correct, the more accurate the model. However, there are a variety

of test metrics that can be used to understand the goodness of a model.

Metrics such as confusion matrix, lift, and ROC for classification, and

various error metrics for regression are explored in Chapter 7.

To test a model, we use what is called a held-aside or test dataset.

When building a model, the original data with known outcomes can

be split randomly into two sets, one that is used for model building,

and another that is used for testing the model. A typical split ratio is

70 percent for build and 30 percent for test, but this often depends on

the amount of data available. If too few cases are available for build-

ing, reserving any records for test may lessen the accuracy of the

model. There is a technique known as cross validation for addressing

this situation [Moore 2006]. However, we do not discuss it further

here as cross validation is not currently defined in JDM.

Just as in model apply, the same transformations must be applied

to the test data before providing the test data to the model. The

model test phase compares the predictions with the known out-

comes. From this, the test metrics are computed.

Transformed

Dataset using

same build data

transformations

and statistics

Held-aside

Test Dataset

(with known

target values)

Lift Result

Transform,

Prepare

Data

Test

Model

Confusion

Matrix

Error

Metrics

Held-aside

Test Data

Test Data ´

OR

ROC

Model

Figure 3-9

Data mining model test process.

Java Data Mining: Strategy, Standard, and Practice

Search WWH ::

Custom Search

Home