Information Technology Reference
In-Depth Information
6.2.3.2 Skew-Specialized Active Learning
Additionally, there exists a body
of research literature on AL specifically to deal with class imbalance problem.
Tomanek and Hahn [18] investigates query-by-committee-based approaches to
sampling labeled sentences for the task of named entity recognition. The goal of
their selection strategy is to encourage class-balanced selections by incorporating
class-specific costs. Unlabeled instances are ordered by a class-weighted,
entropy-based disagreement measure,
−
j
∈{
0
,
1
}
b
j
V(k
j
)/
|
C
|
log
V(k
j
)/
|
C
|
,
where
V(k
j
)
is the number of votes from a committee of size
|
C
|
that an
instance belongs to a class
k
j
.
b
j
is a weight corresponding to the importance
of including a certain class; a larger value of
b
j
corresponds to a increased
tendency to include examples that are thought to belong to this class. From
a window
W
of examples with highest disagreement, instances are selected
greedily based on the model's estimated class membership probabilities so that
the batch selected from the window has the highest probability of having a
balanced class membership.
SVM-based AL has been shown [19] to be a highly effective strategy for
addressing class imbalance without any skew-specific modifications to the algo-
rithm. Bloodgood and Shanker [20] extend the benefits of SVM-based AL by
proposing an approach that incorporates class-specific costs. That is, the typical
C
factor describing an SVM's misclassification penalty is broken up into
C
+
and
C
−
, describing the costs associated with misclassification of positive and negative
examples, respectively, a common approach for improving the performance of
SVMs in cost-sensitive settings. Additionally, cost-sensitive SVMs are known to
yield predictive advantages in imbalanced settings by offering some preference to
an otherwise overlooked class, often using the heuristic for setting class-specific
costs:
C
+
/C
−
=|{
x
|
x
∈−}|
/
|{
x
|
x
∈+}|
, a ratio in inverse proportion to the
number of examples in each class. However, in the AL setting, the true class
ratio is unknown, and the quantity
C
+
/C
−
must be estimated by the AL system.
Bloodgood and Shanker show that it is advantageous to use a preliminary stage
of random selection in order to establish some estimate of the class ratio, and
then proceed with example selection according to the uncertainty-based “simple
margin” criterion using the appropriately tuned cost-sensitive SVM.
AL has also been studied as a way to improve the generalization performance
of resampling strategies that address class imbalance. In these settings, AL is
used to choose a set of instances for labeling, with sampling strategies used to
improve the class distribution. Ertekin [21] presented virtual instance resampling
technique using active learning (VIRTUAL), a hybrid method of oversampling
and AL that forms an adaptive technique for resampling of the minority class
instances. The learner selects the most informative example
x
i
for oversampling,
and the algorithm creates a synthetic instance along the direction of
x
i
's one
of
k
neighbors. The algorithm works in an online manner and builds the classi-
fier incrementally without the need to retrain on the entire labeled dataset after
creating a new synthetic example. This approach, which we present in detail in
Section 6.4, yields an efficient and scalable learning framework.
Search WWH ::
Custom Search