Information Technology Reference
In-Depth Information
1:1
10 : 1
100 : 1
1000 : 1
10000 : 1
1
1
1
1
1
0.9
0.9
0.9
0.9
0.9
0.8
0.8
0.8
0.8
0.8
Random
Uncertainty
Guided
0.7
0.7
0.7
0.7
0.7
0.6
0.6
0.6
0.6
0.6
0.5
0.5
0.5
0.5
0.5
0.4
0.4
0.4
0.4
0.4
0
5000
10,000
0
5000
10,000
10,000
Labeled examples
0
5000
0
5000
10,000
0
5000
10,000
Figure 6.12 Comparison of random sampling and uncertainty sampling and guided
learning on the problem shown in Figure 6.10.
data selection directly depends on the understanding of the space provided by the
“current” model, early stages of acquisitions can result in a vicious cycle of unin-
formative selections, leading to poor quality models and therefore to additional
poor selections.
The difficulties posed by the cold start problem can be particularly acute
in highly skewed or disjunctive problem spaces; informative instances may be
difficult for AL to find because of their variety or rarity, potentially leading
to substantial waste in data selection. Difficulties early in the AL process can,
at least in part, be attributed to the base classifier's poor understanding of the
problem space. This cold start problem is particularly acute in otherwise difficult
domains. Since the value of subsequent label selections depends on base learner's
understanding of the problem space, poor selections in the early phases of AL
propagate their harm across the learning curve.
In many research papers, AL experiments are “primed” with a preselected,
often class-balanced training set. As pointed out by Attenberg and Provost [33],
if the possibility and procedure exist to procure a class-balanced training set to
start the process, maybe the most cost-effective model-development alternative is
not to do AL at all, but to just continue using this procedure. This is exemplified
in Figure 6.12 [33], where the dot-and-hatched lines show the effect of investing
resources to continue to procure a class-balanced, but otherwise random, training
set (as compared with the active acquisition shown in Figure 6.10).
6.8 ALTERNATIVES TO ACTIVE LEARNING FOR IMBALANCED
PROBLEMS
In addition to traditional label acquisition for unlabeled examples, there are other
sorts of data that may be acquired at a cost for the purpose of building or improv-
ing statistical models. The intent of this section is to provide the reader with a
brief overview of some alternative techniques for active data acquisition for pre-
dictive model construction in a cost-restrictive setting. We begin this setting with
a discussion of class-conditional example acquisition, a paradigm related to AL
Search WWH ::




Custom Search