CLASS IMBALANCE AND ACTIVE LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

positions changing. Minority instances drop down to the very bottom (certain

minority) either because they get chosen for labeling, or because labeling some

other instance caused the model to “realize” that they are minority instances.

We see that, early on, the minority instances are mixed all throughout the range

of estimated probabilities, even as the generalization accuracy increases. Then

the model becomes good enough that, abruptly, few minority class instances are

misclassified (above P = 0 . 5). This is the point where the learning curve levels

off for the first time. However, notice that there still are some residual misclassi-

fied minority instances, and in particular that there is a cluster of them for which

the model is relatively certain they are majority instances. It takes many epochs

for the AL to select one of these, at which point the generalization performance

increases markedly—apparently, this was a subconcept that was strongly mis-

classified by the model, and so it was not a high priority for exploration by

the AL.

On the 20 newsgroups dataset, we can examine the minority instances for

which P decreases the most in that late rise in the AUC curve (roughly, they

switch from being misclassified on the lower plateau to being correctly classified

afterward). Recall that the minority (positive) class here is “Science” newsgroups.

It turns out that these late-switching instances are members of the cryptography

(sci.crpyt) subcategory. These pages were classified as non-Science presumably

because before having seen any positive instances of the subcategory, they looked

much more the same as the many computer-oriented subcategories in the (much

more prevalent) non-Science category. As soon as a few were labeled as Science,

the model generalized its notion of Science to include this subcategory (apparently

pretty well).

Density-sensitive AL techniques did not improve on uncertainty sampling

for this particular domain. This was surprising, given the support we have just

provided for our intuition that the concepts are disjunctive. One would expect

a density-oriented technique to be appropriate for this domain. Unfortunately,

in this domain—and we conjecture that this is typical of many domains with

extreme class imbalance—the majority class is even more disjunctive than the

minority class. For example, in 20 newsgroups, Science indeed has four very

different subclasses. However, non-Science has 16 (with much more variety).

Techniques that, for example, try to find as-of-yet unexplored clusters in the

instance space are likely to select from the vast and varied majority class. We

need more research on dealing with highly disjunctive classes, especially when

the less interesting 8 class is more varied than the main class of interest.

6.7 STARTING COLD

The cold start problem has long been known to be a key difficulty in build-

ing effective classifiers quickly and cheaply via AL [13, 16]. Since the quality of

8 How interesting a class is if it could be measured by its relative misclassification cost, for example.

Search WWH ::

Custom Search

Home