Information Technology Reference
In-Depth Information
oversample the minority class, while Tomek and ENN, respectively, are used to
under-sample the majority class.
3.2.4 Ensemble-Based Methods
One popular approach toward improving performance for classification problems
is to use ensembles. Ensemble methods aim to leverage the classification power
of multiple base learners (learned on different subsets of the training data) to
improve on the classification performance over traditional classification algo-
rithms. Dietterich [13] provides a broad overview as to why ensemble methods
often outperform a single classifier. In fact, Hansen and Salamon [14] prove
that under certain constraints (the average error rate is less than 50% and the
probability of misprediction of each classifier is independent of the others), the
expected error rate of an instance goes to zero as the number of classifiers goes
to infinity. Thus, when seeking to build multiple classifiers, it is better to ensure
that the classifiers are diverse rather than highly accurate.
There are many popular methods for building diverse ensembles, including
bagging [15], AdaBoost [16], Random Subspaces [17], and Random Forests
[18]. While each of these ensemble methods can be applied to datasets that
have undergone sampling, in general, this is not optimal as it ignores the power
of combining the ensemble generation method and sampling to create a more
structured approach. As a result, many ensemble methods have been combined
with sampling strategies to create ensemble methods that are more suitable for
dealing with class imbalance.
AdaBoost is one of the most popular ensemble methods in the machine learn-
ing community due, in part, to its attractive theoretical guarantees [16]. As a
result of its popularity, AdaBoost has undergone extensive empirical research
[13, 19]. Recall that in AdaBoost, base classifier L is learned on a subset S L of
the training data D , where each instance in S L is probabilistically selected on
the basis of its weight in D . After training each classifier, each instance's weight
is adaptively updated on the basis of the performance of the ensemble on the
instance. By giving more weight to misclassified instances, the ensemble is able
to focus on instances that are difficult to learn.
SMOTEBoost is one example of combining sampling methods with AdaBoost
to create an ensemble that explicitly aims to overcome the class imbalance [20].
In SMOTEBoost, in addition to updating instance weights during each boosting
iteration, SMOTE is also applied to misclassified minority class examples. Thus,
in addition to emphasizing minority instances by giving higher weights, misclas-
sified minority instances are also emphasized by the addition of (similar) synthetic
examples. Similar to SMOTEBoost, Guo and Viktor [21] develop another exten-
sion for boosting called DataBoost-IM , which identifies hard instances (both
minority and majority) in order to generate similar synthetic examples and then
reweights the instances to prevent a bias toward the majority class.
An alternative to AdaBoost is Bagging, another ensemble method that has
been adapted to use sampling. Radivojac et al. [22] combine bagging with
Search WWH ::




Custom Search