Information Technology Reference
In-Depth Information
examples, while oversampling increases the time required to train a classifier
and also leads to overfitting that occurs to cover the duplicated training examples
[31, 33].
More advanced sampling methods use some intelligence when removing or
adding examples. This can minimize the drawbacks that were just described
and, in the case of intelligently adding examples, has the potential to address
the underlying issue of absolute rarity. One undersampling strategy removes
only majority class examples that are redundant with other examples or border
regions with minority class examples, figuring that they may be the result of
noise [34]. Synthetic minority oversampling technique (SMOTE), on the other
hand, oversamples the data by introducing new, non-replicated minority class
examples from the line segments that join the five minority class nearest neigh-
bors [33]. This tends to expand the decision boundaries associated with the small
disjuncts/rare cases, as opposed to the overfitting associated with random over-
sampling. Another approach is to identify a good class distribution for learning
and then generate samples with that distribution. Once this is done, multiple
training sets with the desired class distribution can be formed using all minority
class examples and a subset of the majority class examples. This can be done so
that each majority class example is guaranteed to occur in at least one training
set; so no data is wasted. The learning algorithm is then applied to each training
set and meta-learning is used to form a composite learner from the resulting clas-
sifiers. This approach can be used with any learning method and it was applied to
four different learning algorithms [1]. The same basic approach for partitioning
the data and learning multiple classifiers has also been used with support vector
machines and an support vector machine (SVM) ensemble has outperformed both
undersampling and oversampling [35].
All of these more sophisticated methods attempt to reduce some of the draw-
backs associated with the simple random sampling methods. But for the most
part, it seems unlikely that they introduce any new knowledge and hence they
do not appear to truly address any of the underlying issues previously identified.
Rather, they at best compensate for learning algorithms that are not well suited to
dealing with class imbalance. This point is made quite clearly in the description
of the SMOTE method, when it is mentioned that the introduction of the new
examples effectively serves to change the bias of the learner, forcing a more
general bias, but only for the minority class. Theoretically, such a modification
to the bias could be implemented at the algorithm level. As discussed later, there
has been research at the algorithm level in modifying the bias of a learner to
better handle imbalanced data.
The sampling methods just described are designed to reduce between-class
imbalance. Although research indicates that reducing between-class imbalance
will also tend to reduce within-class imbalances [4], it is worth considering
whether sampling methods can be used in a more direct manner to reduce
within-class imbalances — and whether this is beneficial. This question has been
studied using artificial domains and the results indicate that it is not sufficient to
eliminate between-class imbalances (i.e., rare classes) in order to learn complex
Search WWH ::




Custom Search