IMBALANCED DATASETS: FROM SAMPLING TO CLASSIFIERS - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

outnumbers the other (called the minority ,or positive class). The class imbalance

problem is when the positive class is the class of interest.

One obvious complication that arises in the class imbalance problem is the

effectiveness of accuracy (and error rate) in determining the performance of a

classifiers. Consider, for example, a dataset for which the majority class represents

99% of the data, and the minority class represents 1% of the data (this dataset is

said to have an imbalance ratio of 99 : 1). In such cases, the naıve classifier, which

always predicts the majority class, will have an accuracy of 99%. Similarly, if

a dataset has an imbalance ratio of 9999 : 1, the majority classifier will have an

accuracy of 99.99%.

One consequence of this limitation can be seen when considering the perfor-

mance of most traditional classifiers when applied in the class imbalance problem.

This is due to the fact that the majority of traditional classifiers optimize the

accuracy, and therefore generate a model that is equivalent to the naıve model

described previously. Obviously such a classifier, in spite of its high accuracies,

is useless in most practical applications as the minority class is often the class

of interest (otherwise a classifier would not be necessary, as the class of interest

almost always happens). As a result, numerous methods have been developed,

which overcome the class imbalance problem. Such methods fall into two general

categories, namely, sampling methods and skew-insensitive classifiers.

Sampling methods (e.g., random over-sampling and random under-sampling)

have become standard approaches for improving classification performance [1].

In sampling methods, the training set is altered in such a way as to create a more

balanced class distribution. The resulting sampled dataset is then more amenable

to traditional data-mining algorithms, which can then be used to classify the data.

Alternatively, methods have been developed to combat the class imbalance

problem directly. These methods are often specifically designed to overcome

the class imbalance problem by optimizing a metric other than accuracy. By

optimizing a metric other than accuracy that is more suitable for the class imbal-

ance problem, skew-insensitive classifiers are able to generate more informative

models.

In this chapter, we discuss the various approaches to overcome class imbal-

ance, as well as various metrics, which can be used to evaluate them.

3.2 SAMPLING METHODS

Sampling is a popular methodology to counter the problem of class imbalance.

The goal of sampling methods is to create a dataset that has a relatively balanced

class distribution, so that traditional classifiers are better able to capture the

decision boundary between the majority and the minority classes. Since the sam-

pling methods are used to make the classification of the minority class instances

easier, the resulting (sampled) dataset should represent a “reasonable” approxi-

mation of the original dataset. Specifically, the resulting dataset should contain

only instances that are, in some sense, similar to those in the original dataset,

that is, all instances in the modified dataset should be drawn from the same (or

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home