Advanced Analytical Theory and Methods: Classification - Data Science and Big Data Analytics

Database Reference

In-Depth Information

7.3 Diagnostics of Classifiers

So far, this topic has talked about three classifiers: logistic regression, decision trees,

and naïve Bayes. These three methods can be used to classify instances into distinct

groups according to the similar characteristics they share. Each of these classifiers

faces the same issue: how to evaluate if they perform well.

A few tools have been designed to evaluate the performance of a classifier. Such

tools are not limited to the three classifiers in this topic but rather serve the purpose

of assessing classifiers in general.

A confusion matrix is a specific table layout that allows visualization of the

performance of a classifier.

Table 7.6 shows the confusion matrix for a two-class classifier. True positives

(TP) are the number of positive instances the classifier correctly identified as

positive. False positives (FP) are the number of instances in which the classifier

identified as positive but in reality are negative. True negatives (TN) are the

number of negative instances the classifier correctly identified as negative. False

negatives (FN) are the number of instances classified as negative but in reality are

positive. In a two-class classification, a preset threshold may be used to separate

positives from negatives. TP and TN are the correct guesses. A good classifier should

have large TP and TN and small (ideally zero) numbers for FP and FN.

Table 7.6 Confusion Matrix

Predicted Class

Positive Negative

Actual Class Positive

True Positives (TP) False Negatives (FN)

Negative

False Positives (FP) True Negatives (TN)

In the bank marketing example, the training set includes 2,000 instances. An

additional 100 instances are included as the testing set. Table 7.7 shows the

confusion matrix of a naïve Bayes classifier on 100 clients to predict whether they

would subscribe to the term deposit. Of the 11 clients who subscribed to the term

deposit, the model predicted 3 subscribed and 8 not subscribed. Similarly, of the

89 clients who did not subscribe to the term, the model predicted 2 subscribed

and 87 not subscribed. All correct guesses are located from top left to bottom right

of the table. It's easy to visually inspect the table for errors, because they will be

represented by any nonzero values outside the diagonal.

Search WWH ::

Custom Search

Home