Information Technology Reference
In-Depth Information
6.4 ADAPTIVE RESAMPLING WITH ACTIVE LEARNING
The analysis in Section 6.3.5 shows the effectiveness of AL on imbalanced
datasets without employing any resampling techniques. This section extends the
discussion on the effectiveness of AL for imbalanced data classification and
demonstrates that even in cases where resampling is the preferred approach, AL
can still be used to significantly improve the classification performance.
In supervised learning, a common strategy to overcome the rarity problem is
to resample the original dataset to decrease the overall level of class imbalance.
Resampling is done either by oversampling the minority (positive) class and/or
under-sampling the majority (negative) class until the classes are approximately
equally represented [28, 30-32]. Oversampling, in its simplest form, achieves a
more balanced class distribution either by duplicating minority class instances or
introducing new synthetic instances that belong to the minority class [30]. No
information is lost in oversampling as all original instances of the minority and
the majority classes are retained in the oversampled dataset. The other strategy
to reduce the class imbalance is under-sampling, which eliminates some majority
class instances mostly by RS.
Even though both approaches address the class imbalance problem, they also
suffer some drawbacks. The under-sampling strategy can potentially sacrifice the
prediction performance of the model, as it is possible to discard informative
instances that the learner might benefit. Oversampling strategy, on the other
hand, can be computationally overwhelming in cases with large training sets—if
a complex oversampling method is used; a large computational effort must be
expended during preprocessing of the data. Worse, oversampling causes longer
training time during the learning process because of the increased number of
training instances. In addition to suffering from increased runtime due to added
computational complexity, it also necessitates an increased memory footprint due
to the extra storage requirements of artificial instances. Other costs associated
with the learning process (i.e., extended kernel matrix in kernel classification
algorithms) further increase the burden of oversampling.
6.4.1 VIRTUAL: Virtual Instance Resampling Technique Using
Active Learning
In this section, the focus is on the oversampling strategy for imbalanced data
classification and investigate how it can benefit from the principles of AL. Our
goal is to remedy the efficiency drawbacks of oversampling in imbalanced data
classification and use an AL strategy to generate minority class instances only if
they can be useful to the learner. VIRTUAL (virtual instance resampling technique
using active learning) [22] is a hybrid method of oversampling and AL that forms
an adaptive technique for resampling of the minority class instances. In contrast to
traditional oversampling techniques that act as an offline step that generates virtual
instances of the minority class before the training process, VIRTUAL leverages the
power of AL to intelligently and adaptively oversample the data during training,
Search WWH ::




Custom Search