Information Technology Reference
In-Depth Information
task is to differentiate sports web pages from nonsports pages. Depending on
the source of the data (e.g., different impression streams from different online
advertisers), one could see very different degrees of class skew in the population
of relevant web pages. The panels in Figure 6.10, left-to-right, depict increasing
amounts of induced class skew. On the far left, we see that for a balanced
class distribution, uncertainty sampling is indeed better than RS. For a 10 : 1
distribution, uncertainty sampling has some problems very early on, but soon
does better than RS—even more so than in the balanced case. However, as
the skew begins to get large, not only does RS start to fail (it finds fewer and
fewer minority instances, and its learning suffers), uncertainty sampling does
substantially worse than random for a considerable amount labeling expenditure.
In the most extreme case shown, 6 both RS and uncertainty sampling simply fail
completely. RS effectively does not select any positive examples, and neither
does uncertainty sampling. 7
A practitioner well versed in the AL literature may decide he/she should use
a method other than uncertainty sampling in such a highly skewed domain. A
variety of techniques have been discussed in Sections 6.2-6.4 for performing
AL specifically under class imbalance, including [18-21, 35], as well as for
performing density-sensitive AL, where the geometry of the problem space is
specifically included when making selections, including [13-15, 17, 36]. While
initially appealing, as problems become increasingly difficult, these techniques
may not provide results better than more traditional AL techniques—indeed class
skews may be sufficiently high to thwart these techniques completely [33].
As discussed later in Section 6.8.1, Attenberg and Provost [33] proposed
an alternative way of using human resources to produce labeled training
set, specifically tasking people with finding class-specific instances (“guided
learning”) as opposed to labeling specific instances. In some domains, finding
such instances may even be cheaper than labeling (per instance). Guided learning
can be much more effective per instance acquired; in one of the Attenberg and
Provost's experiments, it outperformed AL as long as searching for class-specific
instances was less than eight times more expensive (per instance) than labeling
selected instances. The generalization performance of guided learning is shown
in Figure 6.12, discussed in Section 6.8.1 for the same setting as Figure 6.10.
6.6 DEALING WITH DISJUNCTIVE CLASSES
Even more subtly still, certain problem spaces may not have such an extreme
class skew, but may still be particularly difficult because they possess important
but very small disjunctive subconcepts, rather than simple continuously dense
6 10,000 : 1—still orders of magnitude less skewed than some categories.
7 The curious behavior of AUC < 0 . 5 here is due to overfitting. Regularizing the logistic regression
“fixes” the problem, and the curve hovers about 0 . 5. See another article in this issue for more insight
on models exhibiting AUC < 0 . 5 [34].
Search WWH ::




Custom Search