CLASS IMBALANCE AND ACTIVE LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

6.2.3.2 Skew-Specialized Active Learning Additionally, there exists a body

of research literature on AL specifically to deal with class imbalance problem.

Tomanek and Hahn [18] investigates query-by-committee-based approaches to

sampling labeled sentences for the task of named entity recognition. The goal of

their selection strategy is to encourage class-balanced selections by incorporating

class-specific costs. Unlabeled instances are ordered by a class-weighted,

entropy-based disagreement measure, − j ∈{ 0 , 1 } b j V(k j )/ | C | log V(k j )/ | C | ,

where V(k j ) is the number of votes from a committee of size

| C |

that an

instance belongs to a class k j . b j is a weight corresponding to the importance

of including a certain class; a larger value of b j corresponds to a increased

tendency to include examples that are thought to belong to this class. From

a window W of examples with highest disagreement, instances are selected

greedily based on the model's estimated class membership probabilities so that

the batch selected from the window has the highest probability of having a

balanced class membership.

SVM-based AL has been shown [19] to be a highly effective strategy for

addressing class imbalance without any skew-specific modifications to the algo-

rithm. Bloodgood and Shanker [20] extend the benefits of SVM-based AL by

proposing an approach that incorporates class-specific costs. That is, the typical

C factor describing an SVM's misclassification penalty is broken up into C + and

C − , describing the costs associated with misclassification of positive and negative

examples, respectively, a common approach for improving the performance of

SVMs in cost-sensitive settings. Additionally, cost-sensitive SVMs are known to

yield predictive advantages in imbalanced settings by offering some preference to

an otherwise overlooked class, often using the heuristic for setting class-specific

costs: C + /C − =|{ x | x ∈−}| / |{ x | x ∈+}| , a ratio in inverse proportion to the

number of examples in each class. However, in the AL setting, the true class

ratio is unknown, and the quantity C + /C − must be estimated by the AL system.

Bloodgood and Shanker show that it is advantageous to use a preliminary stage

of random selection in order to establish some estimate of the class ratio, and

then proceed with example selection according to the uncertainty-based “simple

margin” criterion using the appropriately tuned cost-sensitive SVM.

AL has also been studied as a way to improve the generalization performance

of resampling strategies that address class imbalance. In these settings, AL is

used to choose a set of instances for labeling, with sampling strategies used to

improve the class distribution. Ertekin [21] presented virtual instance resampling

technique using active learning (VIRTUAL), a hybrid method of oversampling

and AL that forms an adaptive technique for resampling of the minority class

instances. The learner selects the most informative example x i for oversampling,

and the algorithm creates a synthetic instance along the direction of x i 's one

of k neighbors. The algorithm works in an online manner and builds the classi-

fier incrementally without the need to retrain on the entire labeled dataset after

creating a new synthetic example. This approach, which we present in detail in

Section 6.4, yields an efficient and scalable learning framework.

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home