IMBALANCED DATASETS: FROM SAMPLING TO CLASSIFIERS - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

oversample the minority class, while Tomek and ENN, respectively, are used to

under-sample the majority class.

3.2.4 Ensemble-Based Methods

One popular approach toward improving performance for classification problems

is to use ensembles. Ensemble methods aim to leverage the classification power

of multiple base learners (learned on different subsets of the training data) to

improve on the classification performance over traditional classification algo-

rithms. Dietterich [13] provides a broad overview as to why ensemble methods

often outperform a single classifier. In fact, Hansen and Salamon [14] prove

that under certain constraints (the average error rate is less than 50% and the

probability of misprediction of each classifier is independent of the others), the

expected error rate of an instance goes to zero as the number of classifiers goes

to infinity. Thus, when seeking to build multiple classifiers, it is better to ensure

that the classifiers are diverse rather than highly accurate.

There are many popular methods for building diverse ensembles, including

bagging [15], AdaBoost [16], Random Subspaces [17], and Random Forests

[18]. While each of these ensemble methods can be applied to datasets that

have undergone sampling, in general, this is not optimal as it ignores the power

of combining the ensemble generation method and sampling to create a more

structured approach. As a result, many ensemble methods have been combined

with sampling strategies to create ensemble methods that are more suitable for

dealing with class imbalance.

AdaBoost is one of the most popular ensemble methods in the machine learn-

ing community due, in part, to its attractive theoretical guarantees [16]. As a

result of its popularity, AdaBoost has undergone extensive empirical research

[13, 19]. Recall that in AdaBoost, base classifier L is learned on a subset S L of

the training data D , where each instance in S L is probabilistically selected on

the basis of its weight in D . After training each classifier, each instance's weight

is adaptively updated on the basis of the performance of the ensemble on the

instance. By giving more weight to misclassified instances, the ensemble is able

to focus on instances that are difficult to learn.

SMOTEBoost is one example of combining sampling methods with AdaBoost

to create an ensemble that explicitly aims to overcome the class imbalance [20].

In SMOTEBoost, in addition to updating instance weights during each boosting

iteration, SMOTE is also applied to misclassified minority class examples. Thus,

in addition to emphasizing minority instances by giving higher weights, misclas-

sified minority instances are also emphasized by the addition of (similar) synthetic

examples. Similar to SMOTEBoost, Guo and Viktor [21] develop another exten-

sion for boosting called DataBoost-IM , which identifies hard instances (both

minority and majority) in order to generate similar synthetic examples and then

reweights the instances to prevent a bias toward the majority class.

An alternative to AdaBoost is Bagging, another ensemble method that has

been adapted to use sampling. Radivojac et al. [22] combine bagging with

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home