Information Technology Reference
In-Depth Information
outnumbers the other (called the minority ,or positive class). The class imbalance
problem is when the positive class is the class of interest.
One obvious complication that arises in the class imbalance problem is the
effectiveness of accuracy (and error rate) in determining the performance of a
classifiers. Consider, for example, a dataset for which the majority class represents
99% of the data, and the minority class represents 1% of the data (this dataset is
said to have an imbalance ratio of 99 : 1). In such cases, the naıve classifier, which
always predicts the majority class, will have an accuracy of 99%. Similarly, if
a dataset has an imbalance ratio of 9999 : 1, the majority classifier will have an
accuracy of 99.99%.
One consequence of this limitation can be seen when considering the perfor-
mance of most traditional classifiers when applied in the class imbalance problem.
This is due to the fact that the majority of traditional classifiers optimize the
accuracy, and therefore generate a model that is equivalent to the naıve model
described previously. Obviously such a classifier, in spite of its high accuracies,
is useless in most practical applications as the minority class is often the class
of interest (otherwise a classifier would not be necessary, as the class of interest
almost always happens). As a result, numerous methods have been developed,
which overcome the class imbalance problem. Such methods fall into two general
categories, namely, sampling methods and skew-insensitive classifiers.
Sampling methods (e.g., random over-sampling and random under-sampling)
have become standard approaches for improving classification performance [1].
In sampling methods, the training set is altered in such a way as to create a more
balanced class distribution. The resulting sampled dataset is then more amenable
to traditional data-mining algorithms, which can then be used to classify the data.
Alternatively, methods have been developed to combat the class imbalance
problem directly. These methods are often specifically designed to overcome
the class imbalance problem by optimizing a metric other than accuracy. By
optimizing a metric other than accuracy that is more suitable for the class imbal-
ance problem, skew-insensitive classifiers are able to generate more informative
models.
In this chapter, we discuss the various approaches to overcome class imbal-
ance, as well as various metrics, which can be used to evaluate them.
3.2 SAMPLING METHODS
Sampling is a popular methodology to counter the problem of class imbalance.
The goal of sampling methods is to create a dataset that has a relatively balanced
class distribution, so that traditional classifiers are better able to capture the
decision boundary between the majority and the minority classes. Since the sam-
pling methods are used to make the classification of the minority class instances
easier, the resulting (sampled) dataset should represent a “reasonable” approxi-
mation of the original dataset. Specifically, the resulting dataset should contain
only instances that are, in some sense, similar to those in the original dataset,
that is, all instances in the modified dataset should be drawn from the same (or
Search WWH ::




Custom Search