CLASS IMBALANCE AND ACTIVE LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

6.4 ADAPTIVE RESAMPLING WITH ACTIVE LEARNING

The analysis in Section 6.3.5 shows the effectiveness of AL on imbalanced

datasets without employing any resampling techniques. This section extends the

discussion on the effectiveness of AL for imbalanced data classification and

demonstrates that even in cases where resampling is the preferred approach, AL

can still be used to significantly improve the classification performance.

In supervised learning, a common strategy to overcome the rarity problem is

to resample the original dataset to decrease the overall level of class imbalance.

Resampling is done either by oversampling the minority (positive) class and/or

under-sampling the majority (negative) class until the classes are approximately

equally represented [28, 30-32]. Oversampling, in its simplest form, achieves a

more balanced class distribution either by duplicating minority class instances or

introducing new synthetic instances that belong to the minority class [30]. No

information is lost in oversampling as all original instances of the minority and

the majority classes are retained in the oversampled dataset. The other strategy

to reduce the class imbalance is under-sampling, which eliminates some majority

class instances mostly by RS.

Even though both approaches address the class imbalance problem, they also

suffer some drawbacks. The under-sampling strategy can potentially sacrifice the

prediction performance of the model, as it is possible to discard informative

instances that the learner might benefit. Oversampling strategy, on the other

hand, can be computationally overwhelming in cases with large training sets—if

a complex oversampling method is used; a large computational effort must be

expended during preprocessing of the data. Worse, oversampling causes longer

training time during the learning process because of the increased number of

training instances. In addition to suffering from increased runtime due to added

computational complexity, it also necessitates an increased memory footprint due

to the extra storage requirements of artificial instances. Other costs associated

with the learning process (i.e., extended kernel matrix in kernel classification

algorithms) further increase the burden of oversampling.

6.4.1 VIRTUAL: Virtual Instance Resampling Technique Using

Active Learning

In this section, the focus is on the oversampling strategy for imbalanced data

classification and investigate how it can benefit from the principles of AL. Our

goal is to remedy the efficiency drawbacks of oversampling in imbalanced data

classification and use an AL strategy to generate minority class instances only if

they can be useful to the learner. VIRTUAL (virtual instance resampling technique

using active learning) [22] is a hybrid method of oversampling and AL that forms

an adaptive technique for resampling of the minority class instances. In contrast to

traditional oversampling techniques that act as an offline step that generates virtual

instances of the minority class before the training process, VIRTUAL leverages the

power of AL to intelligently and adaptively oversample the data during training,

Search WWH ::

Custom Search

Home