Information Technology Reference
In-Depth Information
over-sampling techniques in the bioinformatics domain. Liu et al. propose two
methods, EasyEnsemble and BalanceCascade [23], that generate training sets
by choosing an equal number of majority and minority class instances from the
training set. Hido and Kashima [24] introduce a variant of bagging, “Roughly
Balanced Bagging” (RB bagging), that alters bagging to emphasize the minority
class.
Finally, Hoens and Chawla [25] propose a method called RSM+SMOTE
that involves combining random subspaces with SMOTE. Specifically, they
note that SMOTE depends on the nearest neighbors of an instance to generate
synthetic instances. Therefore, by choosing different sets of features to apply
SMOTE in (and thereby altering the nearest neighbor calculation used by
SMOTE to create the synthetic instance), the resulting training data for each
base learner will have different biases, promoting a more diverse—and therefore
effective—ensemble.
3.2.5 Drawbacks of Sampling Techniques
One major drawback of sampling techniques is that one needs to determine how
much sampling to apply. An over-sampling level must be chosen so as to promote
the minority class, while avoiding overfitting to the given data. Similarly, an
under-sampling level must be chosen so as to retain as much information about
the majority class as possible, while promoting a balanced class distribution.
In general, wrapper methods are used to solve this problem. In wrapper meth-
ods, the training data is split into a training set and a validation set. For a variety
of sampling levels, classifiers are learned on the training set. The performance
of each of the learned models is then tested against the validation set. The sam-
pling method that provides the best performance is then used to sample the entire
dataset.
For hybrid techniques, such wrappers become very complicated, as instead of
having to optimize a single over- (or under-) sampling level, one has to optimize
a combination of over- and under-sampling levels. As demonstrated by Cieslak
et al. [26], such wrapper techniques are often less effective at combating class
imbalance than ensembles built of skew- insensitive classifiers. As a result, we
now turn our focus to skew-insensitive classifiers.
3.3 SKEW-INSENSITIVE CLASSIFIERS FOR CLASS IMBALANCE
While sampling methods—and ensemble methods based on sampling meth-
ods—have become the de facto standard for learning in datasets that exhibit
class imbalance, methods have also been developed that aim to directly com-
bat class imbalance without the need for sampling. These methods come mainly
from the cost-sensitive learning community; however, classifiers that deal with
imbalance are not necessarily cost-sensitive learners.
Search WWH ::




Custom Search