Java Reference
In-Depth Information
do not know whether an unsupervised model is correct or not, only if
it is useful through manual inspection.
The notion of testing a supervised model is simple. Take a set of
cases with known outcomes and apply the model to those cases to
generate predictions. Compare the known, or actual, outcomes with
the predicted outcomes. Simplistically, the more predictions the model
gets correct, the more accurate the model. However, there are a variety
of test metrics that can be used to understand the goodness of a model.
Metrics such as confusion matrix, lift, and ROC for classification, and
various error metrics for regression are explored in Chapter 7.
To test a model, we use what is called a held-aside or test dataset.
When building a model, the original data with known outcomes can
be split randomly into two sets, one that is used for model building,
and another that is used for testing the model. A typical split ratio is
70 percent for build and 30 percent for test, but this often depends on
the amount of data available. If too few cases are available for build-
ing, reserving any records for test may lessen the accuracy of the
model. There is a technique known as cross validation for addressing
this situation [Moore 2006]. However, we do not discuss it further
here as cross validation is not currently defined in JDM.
Just as in model apply, the same transformations must be applied
to the test data before providing the test data to the model. The
model test phase compares the predictions with the known out-
comes. From this, the test metrics are computed.
Transformed
Dataset using
same build data
transformations
and statistics
Held-aside
Test Dataset
(with known
target values)
Lift Result
Transform,
Prepare
Data
Test
Model
Confusion
Matrix
Error
Metrics
Held-aside
Test Data
Test Data ยด
OR
ROC
Model
Figure 3-9
Data mining model test process.
Search WWH ::




Custom Search