Information Technology Reference
In-Depth Information
example with the highest score is assigned the rank 1. Then, we can calculate
the AUC as:
| T p |
i = 1 (R i i)
| T p || T n |
AUC (f ) =
where T p T and T n T are, respectively, the subsets of positive and negative
examples in test set T ,and R i is the rank of the i th example in T p given by
classifier f .
AUC basically measures the probability of the classifier assigning a higher
rank to a randomly chosen positive example than a randomly chosen negative
example. Even though the AUC attempts to be a summary statistic, just as other
single metric performance measures, it too loses significant information about the
behavior of the learning algorithm over the entire operating range (for instance,
it misses information on concavities in the performance, or trade-off behaviors
between the TP and FP performances).
It can be argued that the AUC is a good way to get a score for the general
performance of a classifier and to compare it to that of another classifier. This
is particularly true in the case of imbalanced data where, as discussed earlier,
accuracy is too strongly biased toward the dominant class. However, some criti-
cisms have also appeared warning against the use of AUC across classifiers for
comparative purposes. One of the most obvious ones is that if the ROC curves
of the two classifiers intersect (such as in the case of Figure 8.2), then the AUC-
based comparison between the classifiers can be relatively uninformative and
even misleading. However, a possibly more serious limitation of the AUC for
comparative purposes lies in the fact that the misclassification cost distributions
(and hence the skew-ratio distributions) used by the AUC are different for dif-
ferent classifiers. This is discussed in the next subsection, which generally looks
at newer and more experimental ranking metrics and graphical methods.
8.4.5 Newer Ranking Metrics and Graphical Methods
8.4.5.1 The H -measure The more serious criticism of AUC just mentioned
means that, when comparing different classifiers using the AUC, one may in
fact be comparing oranges and apples, as the AUC may give more weight to
misclassifying a point by classifier A than it does by classifier B. This is because
the AUC uses an implicit weight function that varies from classifier to classifier.
This criticism was made by Hand [16], who also proposed the H -measure to
remedy this problem. The H -measure allows the user to select a cost-weight
function that is equal for all the classifiers under comparison and thus allows for
fairer comparisons. The formulation of the H -measure is a little involved and
will not be discussed here. The reader is referred to [16] for further details about
the H -measure as well as a pointer to R code implementing it.
It is worth noting, however, that Hand [16]'s criticism was recently challenged
by Flach et al. [17], who found that the criticism may only hold when the AUC is
Search WWH ::




Custom Search