CLASS IMBALANCE AND ACTIVE LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

Zhu and Hovy [22] describe a bootstrap-based oversampling strategy

(BootOS) that, given an example to be resampled, generates a bootstrap example

based on all the k neighbors of that example. At each epoch, the examples

with the greatest uncertainty are selected for labeling and incorporated into a

labeled set, L .From L , the proposed oversampling strategy is applied, yielding

a more balanced dataset, L , a dataset that is used to retrain the base model.

The selection of the examples with the highest uncertainty for labeling at each

iteration involves resampling the labeled examples and training a new classifier

with the resampled dataset; therefore, scalability of this approach may be a

concern for large-scale datasets.

In the next section, we demonstrate that the principles of AL are naturally

suited to address the class imbalance problem and that AL can in fact be an

effective strategy to have a balanced view of an otherwise imbalanced dataset,

without the need to resort to resampling techniques. It is worth noting that

the goal of the next section is not to cast AL as a replacement for resam-

pling strategies. Rather, our main goal is to demonstrate how AL can allevi-

ate the issues that stem from class imbalance and present AL as an alternate

technique that should be considered in case a resampling approach is imprac-

tical, inefficient, or ineffective. In problems where resampling is the preferred

solution, we show in Section 6.4 that the benefits of AL can still be lever-

aged to address class imbalance. In particular, we present an adaptive over-

sampling technique that uses AL to determine which examples to resample

in an online setting. These two different approaches show the versatility of

AL and the importance of selective sampling to address the class imbalance

problem.

6.3 ACTIVE LEARNING FOR IMBALANCED DATA CLASSIFICATION

As outlined in Section 6.2.1, AL is primarily considered as a technique to reduce

the number of training samples that need to be labeled for a classification task.

From a traditional perspective, the active learner has access to a vast pool of

unlabeled examples, and it aims to make a clever choice to select the most

informative example to obtain its label. However, even in the cases where the

labels of training data are already available, AL can still be leveraged to obtain the

informative examples through training sets [23-25]. For example, in large-margin

classifiers such as SVM, the informativeness of an example is synonymous with

its distance to the hyperplane. The farther an example is to the hyperplane, the

more the learner is confident about its true class label; hence there is little, if any,

benefit that the learner can gain by asking for the label of that example. On the

other hand, the examples close to the hyperplane are the ones that yield the most

information to the learner. Therefore, the most commonly used AL strategy in

SVMs is to check the distance of each unlabeled example to the hyperplane and

focus on the examples that lie closest to the hyperplane, as they are considered

to be the most informative examples to the learner [8].

Search WWH ::

Custom Search

Home