CLASS IMBALANCE AND ACTIVE LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

among the earliest successful examples of active machine learning techniques [3,

4]. The intuition behind uncertainty-based selection is that this region surrounding

a model's decision boundary is where that model is most likely to make mis-

takes. Incorporating labeled examples from this region may improve the model's

performance along this boundary, leading to gains in overall accuracy.

Many popular subsequent techniques are specializations of uncertainty selec-

tion, including query-by-committee-based approaches [5-7], where, given an

ensemble of (valid) predictive models, examples are selected based on the level

of disagreement elicited among the ensemble, and the popular “simple margin”

technique proposed by Tong and Koller [8], where, given a current parameter-

ization of a support vector machine (SVM), w j , the example x i is chosen that

comes closest to the decision boundary, x i

) is a

function mapping an example to an alternate space utilized by the kernel function

in the SVM: k(u, v)

=

argmin x |

w j (x)

|

, where (

·

(u)(v) .

Expected-utility-based approaches, where examples are chosen based on the

estimated expected improvement in a certain objective, are achieved by incorpo-

rating a given example into the training set. 2 Such techniques often involve costly

nested cross-validation where each available example is assigned all possible label

states [9-11].

=

6.2.2 Dealing with the Class Imbalance Problem in Active Learning

Selecting examples from an unlabeled pool with substantial class imbalance may

pose several difficulties for traditional AL. The greater proportion of examples

in the majority class may lead to a model that prefers one class over another.

If the labels of examples selected by an AL scheme are thought of as a random

variable, the innate class imbalance in the example pool would almost certainly

lead to a preference for majority examples in the training set. Unless properly

dealt with, 3 this over-representation may simply lead to a predictive preference

for the majority class when labeling. Typically, when making predictive models

in an imbalanced setting, it is the minority class that is of interest. For instance,

it is important to discover patients who have a rare but dangerous ailment based

on the results of a blood test, or infrequent but costly fraud in a credit card

company's transaction history. This difference in class preferences between an

end system's needs and a model's tendencies causes a serious problem for AL

(and predictive systems in general) in imbalanced settings. Even if the problem

of highly imbalanced (although correct in terms of base rate) training set prob-

lem can be dealt with, the tendency for a selection algorithm to gather majority

examples creates other problems. The nuances of the minority set may be poorly

represented in the training data, leading to a “predictive misunderstanding” in

2 Note that this selection objective may not necessarily be the same objective used during the base

model's use time. For instance, examples may be selected according to their contribution to the

reduction in problem uncertainty.

3 For instance, by imbalanced learning techniques described throughout this topic.

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home