Classification: Basic Concepts - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

model. The cost associated with a false negative (such as incorrectly predicting that a

cancerous patient is not cancerous) is far greater than those of a false positive

(incorrectly yet conservatively labeling a noncancerous patient as cancerous). In such

cases, we can outweigh one type of error over another by assigning a different cost to

each. These costs may consider the danger to the patient, financial costs of resulting

therapies, and other hospital costs. Similarly, the benefits associated with a true positive

decision may be different than those of a true negative. Up to now, to compute classifier

accuracy, we have assumed equal costs and essentially divided the sum of true positives

and true negatives by the total number of test tuples.

Alternatively, we can incorporate costs and benefits by instead computing the average

cost (or benefit) per decision. Other applications involving cost-benefit analysis include

loan application decisions and target marketing mailouts. For example, the cost of loan-

ing to a defaulter greatly exceeds that of the lost business incurred by denying a loan to a

nondefaulter. Similarly, in an application that tries to identify households that are likely

to respond to mailouts of certain promotional material, the cost of mailouts to numer-

ous households that do not respond may outweigh the cost of lost business from not

mailing to households that would have responded. Other costs to consider in the overall

analysis include the costs to collect the data and to develop the classification tool.

Receiver operating characteristic curves are a useful visual tool for comparing two

classification models. ROC curves come from signal detection theory that was deve-

loped during World War II for the analysis of radar images. An ROC curve for a given

model shows the trade-off between the true positive rate ( TPR ) and the false positive rate

( FPR ). 10 Given a test set and a model, TPR is the proportion of positive (or “yes”) tuples

that are correctly labeled by the model; FPR is the proportion of negative (or “no”)

tuples that are mislabeled as positive. Given that TP , FP , P , and N are the number of

true positive, false positive, positive, and negative tuples, respectively, from Section 8.5.1

we

TPR D T P

FPR D F N

know

that

,

which

is

sensitivity.

Furthermore,

,

which

is

1 specificity .

For a two-class problem, an ROC curve allows us to visualize the trade-off between

the rate at which the model can accurately recognize positive cases versus the rate at

which it mistakenly identifies negative cases as positive for different portions of the test

set. Any increase in TPR occurs at the cost of an increase in FPR . The area under the

ROC curve is a measure of the accuracy of the model.

To plot an ROC curve for a given classification model, M , the model must be able to

return a probability of the predicted class for each test tuple. With this information, we

rank and sort the tuples so that the tuple that is most likely to belong to the positive or

“yes” class appears at the top of the list, and the tuple that is least likely to belong to the

positive class lands at the bottom of the list. Naıve Bayesian (Section 8.3) and backpropa-

gation (Section 9.2) classifiers return a class probability distribution for each prediction

and, therefore, are appropriate, although other classifiers, such as decision tree classifiers

(Section 8.2), can easily be modified to return class probability predictions. Let the value

10 TPR and FPR are the two operating characteristics being compared.

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home