Information Technology Reference
In-Depth Information
among the earliest successful examples of active machine learning techniques [3,
4]. The intuition behind uncertainty-based selection is that this region surrounding
a model's decision boundary is where that model is most likely to make mis-
takes. Incorporating labeled examples from this region may improve the model's
performance along this boundary, leading to gains in overall accuracy.
Many popular subsequent techniques are specializations of uncertainty selec-
tion, including query-by-committee-based approaches [5-7], where, given an
ensemble of (valid) predictive models, examples are selected based on the level
of disagreement elicited among the ensemble, and the popular “simple margin”
technique proposed by Tong and Koller [8], where, given a current parameter-
ization of a support vector machine (SVM), w j , the example x i is chosen that
comes closest to the decision boundary, x i
) is a
function mapping an example to an alternate space utilized by the kernel function
in the SVM: k(u, v)
=
argmin x |
w j (x)
|
, where (
·
(u)(v) .
Expected-utility-based approaches, where examples are chosen based on the
estimated expected improvement in a certain objective, are achieved by incorpo-
rating a given example into the training set. 2 Such techniques often involve costly
nested cross-validation where each available example is assigned all possible label
states [9-11].
=
6.2.2 Dealing with the Class Imbalance Problem in Active Learning
Selecting examples from an unlabeled pool with substantial class imbalance may
pose several difficulties for traditional AL. The greater proportion of examples
in the majority class may lead to a model that prefers one class over another.
If the labels of examples selected by an AL scheme are thought of as a random
variable, the innate class imbalance in the example pool would almost certainly
lead to a preference for majority examples in the training set. Unless properly
dealt with, 3 this over-representation may simply lead to a predictive preference
for the majority class when labeling. Typically, when making predictive models
in an imbalanced setting, it is the minority class that is of interest. For instance,
it is important to discover patients who have a rare but dangerous ailment based
on the results of a blood test, or infrequent but costly fraud in a credit card
company's transaction history. This difference in class preferences between an
end system's needs and a model's tendencies causes a serious problem for AL
(and predictive systems in general) in imbalanced settings. Even if the problem
of highly imbalanced (although correct in terms of base rate) training set prob-
lem can be dealt with, the tendency for a selection algorithm to gather majority
examples creates other problems. The nuances of the minority set may be poorly
represented in the training data, leading to a “predictive misunderstanding” in
2 Note that this selection objective may not necessarily be the same objective used during the base
model's use time. For instance, examples may be selected according to their contribution to the
reduction in problem uncertainty.
3 For instance, by imbalanced learning techniques described throughout this topic.
Search WWH ::




Custom Search