IMBALANCED DATASETS: FROM SAMPLING TO CLASSIFIERS - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

over-sampling techniques in the bioinformatics domain. Liu et al. propose two

methods, EasyEnsemble and BalanceCascade [23], that generate training sets

by choosing an equal number of majority and minority class instances from the

training set. Hido and Kashima [24] introduce a variant of bagging, “Roughly

Balanced Bagging” (RB bagging), that alters bagging to emphasize the minority

class.

Finally, Hoens and Chawla [25] propose a method called RSM+SMOTE

that involves combining random subspaces with SMOTE. Specifically, they

note that SMOTE depends on the nearest neighbors of an instance to generate

synthetic instances. Therefore, by choosing different sets of features to apply

SMOTE in (and thereby altering the nearest neighbor calculation used by

SMOTE to create the synthetic instance), the resulting training data for each

base learner will have different biases, promoting a more diverse—and therefore

effective—ensemble.

3.2.5 Drawbacks of Sampling Techniques

One major drawback of sampling techniques is that one needs to determine how

much sampling to apply. An over-sampling level must be chosen so as to promote

the minority class, while avoiding overfitting to the given data. Similarly, an

under-sampling level must be chosen so as to retain as much information about

the majority class as possible, while promoting a balanced class distribution.

In general, wrapper methods are used to solve this problem. In wrapper meth-

ods, the training data is split into a training set and a validation set. For a variety

of sampling levels, classifiers are learned on the training set. The performance

of each of the learned models is then tested against the validation set. The sam-

pling method that provides the best performance is then used to sample the entire

dataset.

For hybrid techniques, such wrappers become very complicated, as instead of

having to optimize a single over- (or under-) sampling level, one has to optimize

a combination of over- and under-sampling levels. As demonstrated by Cieslak

et al. [26], such wrapper techniques are often less effective at combating class

imbalance than ensembles built of skew- insensitive classifiers. As a result, we

now turn our focus to skew-insensitive classifiers.

3.3 SKEW-INSENSITIVE CLASSIFIERS FOR CLASS IMBALANCE

While sampling methods—and ensemble methods based on sampling meth-

ods—have become the de facto standard for learning in datasets that exhibit

class imbalance, methods have also been developed that aim to directly com-

bat class imbalance without the need for sampling. These methods come mainly

from the cost-sensitive learning community; however, classifiers that deal with

imbalance are not necessarily cost-sensitive learners.

Search WWH ::

Custom Search

Home