Databases Reference
In-Depth Information
8.5 Model Evaluation and Selection
Now that you may have built a classification model, there may be many questions going
through your mind. For example, suppose you used data from previous sales to build
a classifier to predict customer purchasing behavior. You would like an estimate of how
accurately the classifier can predict the purchasing behavior of future customers, that
is, future customer data on which the classifier has not been trained. You may even
have tried different methods to build more than one classifier and now wish to compare
their accuracy. But what is accuracy? How can we estimate it? Are some measures of a
classifier's accuracy more appropriate than others? How can we obtain a reliable accuracy
estimate? These questions are addressed in this section.
Section 8.5.1 describes various evaluation metrics for the predictive accuracy
of a classifier. Holdout and random subsampling (Section 8.5.2), cross-validation
(Section 8.5.3), and bootstrap methods (Section 8.5.4) are common techniques for
assessing accuracy, based on randomly sampled partitions of the given data. What if
we have more than one classifier and want to choose the “best” one? This is referred
to as model selection (i.e., choosing one classifier over another). The last two sections
address this issue. Section 8.5.5 discusses how to use tests of statistical significance
to assess whether the difference in accuracy between two classifiers is due to chance.
Section 8.5.6 presents how to compare classifiers based on cost-benefit and receiver
operating characteristic (ROC) curves.
8.5.1 Metrics for Evaluating Classifier Performance
This section presents measures for assessing how good or how “accurate” your classifier
is at predicting the class label of tuples. We will consider the case of where the class tuples
are more or less evenly distributed, as well as the case where classes are unbalanced (e.g.,
where an important class of interest is rare such as in medical tests). The classifier eval-
uation measures presented in this section are summarized in Figure 8.13. They include
accuracy (also known as recognition rate), sensitivity (or recall), specificity, precision,
F 1 , and F
. Note that although accuracy is a specific measure, the word “accuracy” is
also used as a general term to refer to a classifier's predictive abilities.
Using training data to derive a classifier and then estimate the accuracy of the
resulting learned model can result in misleading overoptimistic estimates due to over-
specialization of the learning algorithm to the data. (We will say more on this in a
moment!) Instead, it is better to measure the classifier's accuracy on a test set consisting
of class-labeled tuples that were not used to train the model.
Before we discuss the various measures, we need to become comfortable with
some terminology. Recall that we can talk in terms of positive tuples (tuples of the
main class of interest) and negative tuples (all other tuples). 6 Given two classes, for
example, the positive tuples may be buys computer D yes while the negative tuples are
6 In the machine learning and pattern recognition literature, these are referred to as positive samples and
negative samples , respectively.
Search WWH ::

Custom Search