Java Data Mining Concepts - Java Data Mining: Strategy, Standard, and Practice

Java Reference

In-Depth Information

The structure of this table looks similar to the cost matrix that was

illustrated in Figure 7-2, but the confusion matrix cells have the model's

incorrect and correct prediction counts. If we consider Attriter as the

positive target value, false-positive ( FP ) prediction count is 60, and the

false-negative ( FN ) prediction count is 30.

Although the confusion matrix measures misclassification of target

values, in our example, false-negatives are three times costlier than

the false-positives. To assess model quality from a business perspec-

tive, we need to measure cost in addition to accuracy. The total cost

of false predictions is 3

150. If with a different model

you get 40 false-positives and 40 false-negatives, then the overall

accuracy is better, however total cost is more at 3

30

1

60

160. If

a cost matrix is specified, it is important to consider cost values to mea-

sure the performance and select the model with the least cost value.

Receiver operating characteristics ( ROC ) is another way to compare

classification model quality. An ROC graph places the false positive

rate on the X-axis and true positive rate on the Y-axis as shown in

Figure 7-7. Here, the false positive rate is the ratio of the number of

false positives and the total number of actual negatives. Similarly, the

true positive rate is the ratio of the number of true positives and the

total number of actual positives.

To plot the ROC graph, the test task determines the false positive

and true positive rates at different probability thresholds . Here, the

probability threshold is the level above which a probability of the

predicted positive target value is considered a positive prediction.

Different probability threshold values result in different false positive

rates and true positive rates. For example, when the Attriter predic-

tion probability is 0.4 and the probability threshold is set to 0.3, the

customer is predicted as an Attriter . Whereas if the probability

threshold is 0.5, the customer is predicted as a Non-attriter as

illustrated in Figure 7-7(a).

Figure 7-7(b) illustrates the ROC curves of two classification models

that are plotted at different probability thresholds. These models per-

form better at different false positive rates; for example, at a false

positive rate of 0.1, Model B has better true positives than Model A.

However, at 0.3 and above the false positive rate of Model A outper-

formed that of Model B. Based on the accepted false positive rate,

users can select the model and its probability threshold. The area

under the ROC curve is another measure of overall performance of a

classification model. The higher the area under the ROC curve, gen-

erally, the better the model performance.

40

1

40

Java Data Mining: Strategy, Standard, and Practice

Search WWH ::

Custom Search

Home