do not know whether an unsupervised model is correct or not, only if
it is useful through manual inspection.
The notion of testing a supervised model is simple. Take a set of
cases with known outcomes and apply the model to those cases to
generate predictions. Compare the known, or actual, outcomes with
the predicted outcomes. Simplistically, the more predictions the model
gets correct, the more accurate the model. However, there are a variety
of test metrics that can be used to understand the goodness of a model.
Metrics such as confusion matrix, lift, and ROC for classification, and
various error metrics for regression are explored in Chapter 7.
To test a model, we use what is called a held-aside or test dataset.
When building a model, the original data with known outcomes can
be split randomly into two sets, one that is used for model building,
and another that is used for testing the model. A typical split ratio is
70 percent for build and 30 percent for test, but this often depends on
the amount of data available. If too few cases are available for build-
ing, reserving any records for test may lessen the accuracy of the
model. There is a technique known as cross validation for addressing
this situation [Moore 2006]. However, we do not discuss it further
here as cross validation is not currently defined in JDM.
Just as in model apply, the same transformations must be applied
to the test data before providing the test data to the model. The
model test phase compares the predictions with the known out-
comes. From this, the test metrics are computed.
same build data
Test Data ´
Data mining model test process.