Classification: Basic Concepts - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

8.5 Model Evaluation and Selection

Now that you may have built a classification model, there may be many questions going

through your mind. For example, suppose you used data from previous sales to build

a classifier to predict customer purchasing behavior. You would like an estimate of how

accurately the classifier can predict the purchasing behavior of future customers, that

is, future customer data on which the classifier has not been trained. You may even

have tried different methods to build more than one classifier and now wish to compare

their accuracy. But what is accuracy? How can we estimate it? Are some measures of a

classifier's accuracy more appropriate than others? How can we obtain a reliable accuracy

estimate? These questions are addressed in this section.

Section 8.5.1 describes various evaluation metrics for the predictive accuracy

of a classifier. Holdout and random subsampling (Section 8.5.2), cross-validation

(Section 8.5.3), and bootstrap methods (Section 8.5.4) are common techniques for

assessing accuracy, based on randomly sampled partitions of the given data. What if

we have more than one classifier and want to choose the “best” one? This is referred

to as model selection (i.e., choosing one classifier over another). The last two sections

address this issue. Section 8.5.5 discusses how to use tests of statistical significance

to assess whether the difference in accuracy between two classifiers is due to chance.

Section 8.5.6 presents how to compare classifiers based on cost-benefit and receiver

operating characteristic (ROC) curves.

8.5.1 Metrics for Evaluating Classifier Performance

This section presents measures for assessing how good or how “accurate” your classifier

is at predicting the class label of tuples. We will consider the case of where the class tuples

are more or less evenly distributed, as well as the case where classes are unbalanced (e.g.,

where an important class of interest is rare such as in medical tests). The classifier eval-

uation measures presented in this section are summarized in Figure 8.13. They include

accuracy (also known as recognition rate), sensitivity (or recall), specificity, precision,

F 1 , and F

. Note that although accuracy is a specific measure, the word “accuracy” is

also used as a general term to refer to a classifier's predictive abilities.

Using training data to derive a classifier and then estimate the accuracy of the

resulting learned model can result in misleading overoptimistic estimates due to over-

specialization of the learning algorithm to the data. (We will say more on this in a

moment!) Instead, it is better to measure the classifier's accuracy on a test set consisting

of class-labeled tuples that were not used to train the model.

Before we discuss the various measures, we need to become comfortable with

some terminology. Recall that we can talk in terms of positive tuples (tuples of the

main class of interest) and negative tuples (all other tuples). 6 Given two classes, for

example, the positive tuples may be buys computer D yes while the negative tuples are

6 In the machine learning and pattern recognition literature, these are referred to as positive samples and

negative samples , respectively.

Search WWH ::

Custom Search

Home