Information Technology Reference
In-Depth Information
positions changing. Minority instances drop down to the very bottom (certain
minority) either because they get chosen for labeling, or because labeling some
other instance caused the model to “realize” that they are minority instances.
We see that, early on, the minority instances are mixed all throughout the range
of estimated probabilities, even as the generalization accuracy increases. Then
the model becomes good enough that, abruptly, few minority class instances are
misclassified (above P = 0 . 5). This is the point where the learning curve levels
off for the first time. However, notice that there still are some residual misclassi-
fied minority instances, and in particular that there is a cluster of them for which
the model is relatively certain they are majority instances. It takes many epochs
for the AL to select one of these, at which point the generalization performance
increases markedly—apparently, this was a subconcept that was strongly mis-
classified by the model, and so it was not a high priority for exploration by
the AL.
On the 20 newsgroups dataset, we can examine the minority instances for
which P decreases the most in that late rise in the AUC curve (roughly, they
switch from being misclassified on the lower plateau to being correctly classified
afterward). Recall that the minority (positive) class here is “Science” newsgroups.
It turns out that these late-switching instances are members of the cryptography
(sci.crpyt) subcategory. These pages were classified as non-Science presumably
because before having seen any positive instances of the subcategory, they looked
much more the same as the many computer-oriented subcategories in the (much
more prevalent) non-Science category. As soon as a few were labeled as Science,
the model generalized its notion of Science to include this subcategory (apparently
pretty well).
Density-sensitive AL techniques did not improve on uncertainty sampling
for this particular domain. This was surprising, given the support we have just
provided for our intuition that the concepts are disjunctive. One would expect
a density-oriented technique to be appropriate for this domain. Unfortunately,
in this domain—and we conjecture that this is typical of many domains with
extreme class imbalance—the majority class is even more disjunctive than the
minority class. For example, in 20 newsgroups, Science indeed has four very
different subclasses. However, non-Science has 16 (with much more variety).
Techniques that, for example, try to find as-of-yet unexplored clusters in the
instance space are likely to select from the vast and varied majority class. We
need more research on dealing with highly disjunctive classes, especially when
the less interesting 8 class is more varied than the main class of interest.
6.7 STARTING COLD
The cold start problem has long been known to be a key difficulty in build-
ing effective classifiers quickly and cheaply via AL [13, 16]. Since the quality of
8 How interesting a class is if it could be measured by its relative misclassification cost, for example.
Search WWH ::




Custom Search