Information Technology Reference
In-Depth Information
6.2.3.2 Skew-Specialized Active Learning Additionally, there exists a body
of research literature on AL specifically to deal with class imbalance problem.
Tomanek and Hahn [18] investigates query-by-committee-based approaches to
sampling labeled sentences for the task of named entity recognition. The goal of
their selection strategy is to encourage class-balanced selections by incorporating
class-specific costs. Unlabeled instances are ordered by a class-weighted,
entropy-based disagreement measure, j ∈{ 0 , 1 } b j V(k j )/ | C | log V(k j )/ | C | ,
where V(k j ) is the number of votes from a committee of size
| C |
that an
instance belongs to a class k j . b j is a weight corresponding to the importance
of including a certain class; a larger value of b j corresponds to a increased
tendency to include examples that are thought to belong to this class. From
a window W of examples with highest disagreement, instances are selected
greedily based on the model's estimated class membership probabilities so that
the batch selected from the window has the highest probability of having a
balanced class membership.
SVM-based AL has been shown [19] to be a highly effective strategy for
addressing class imbalance without any skew-specific modifications to the algo-
rithm. Bloodgood and Shanker [20] extend the benefits of SVM-based AL by
proposing an approach that incorporates class-specific costs. That is, the typical
C factor describing an SVM's misclassification penalty is broken up into C + and
C , describing the costs associated with misclassification of positive and negative
examples, respectively, a common approach for improving the performance of
SVMs in cost-sensitive settings. Additionally, cost-sensitive SVMs are known to
yield predictive advantages in imbalanced settings by offering some preference to
an otherwise overlooked class, often using the heuristic for setting class-specific
costs: C + /C =|{ x | x ∈−}| / |{ x | x ∈+}| , a ratio in inverse proportion to the
number of examples in each class. However, in the AL setting, the true class
ratio is unknown, and the quantity C + /C must be estimated by the AL system.
Bloodgood and Shanker show that it is advantageous to use a preliminary stage
of random selection in order to establish some estimate of the class ratio, and
then proceed with example selection according to the uncertainty-based “simple
margin” criterion using the appropriately tuned cost-sensitive SVM.
AL has also been studied as a way to improve the generalization performance
of resampling strategies that address class imbalance. In these settings, AL is
used to choose a set of instances for labeling, with sampling strategies used to
improve the class distribution. Ertekin [21] presented virtual instance resampling
technique using active learning (VIRTUAL), a hybrid method of oversampling
and AL that forms an adaptive technique for resampling of the minority class
instances. The learner selects the most informative example x i for oversampling,
and the algorithm creates a synthetic instance along the direction of x i 's one
of k neighbors. The algorithm works in an online manner and builds the classi-
fier incrementally without the need to retrain on the entire labeled dataset after
creating a new synthetic example. This approach, which we present in detail in
Section 6.4, yields an efficient and scalable learning framework.
Search WWH ::




Custom Search