IMBALANCED DATASETS: FROM SAMPLING TO CLASSIFIERS - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

similar) distribution to those originally in the dataset. Note that sampling meth-

ods need not create an exactly balanced distribution, merely a distribution that

the traditional classifiers are better able to handle.

Two of the first sampling methods developed were random under-sampling

and random over-sampling. In the random under-sampling, the majority class

instances are discarded at random until a more balanced distribution is reached.

Consider, for example, a dataset consisting of 10 minority class instances and

100 majority class instances. In random under-sampling, one might attempt to

create a balanced class distribution by selecting 90 majority class instances at

random to be removed. The resulting dataset will then consist of 20 instances:

10 (randomly remaining) majority class instances and (the original) 10 minority

class instances.

Alternatively, in random over-sampling, minority class instances are copied

and repeated in the dataset until a more balanced distribution is reached. Thus, if

there are two minority class instances and 100 majority class instances, traditional

over-sampling would copy the two minority class instances 49 times each. The

resulting dataset would then consists of 200 instances: the 100 majority class

instances and 100 minority class instances (i.e., 50 copies each of the two minority

class instances).

While random under-sampling and random over-sampling create more bal-

anced distributions, they both suffer from serious drawbacks. For example, in

random under-sampling (potentially), vast quantities of data are discarded. In the

random under-sampling example mentioned above, for instance, roughly 82%

of the data (the 90 majority class instances) was discarded. This can be highly

problematic, as the loss of such data can make the decision boundary between

minority and majority instances harder to learn, resulting in a loss in classification

performance.

Alternatively, in random over-sampling, instances are repeated (sometimes

to very high degrees). Consider the random over-sampling example mentioned

above, where each instance had to be replicated 49 times in order to balance out

the class distribution. By copying instances in this way, one can cause drastic

overfitting to occur in the classifier, making the generalization performance of

the classifier exceptionally poor. The potential for overfitting is especially true as

the class imbalance ratio becomes worse, and each instance must be replicated

more and more often.

In order to overcome these limitations, more sophisticated sampling techniques

have been developed. We now describe some of these techniques.

3.2.1 Under-Sampling Techniques

The major drawback of random under-sampling is that potentially useful infor-

mation can be discarded when samples are chosen randomly. In order to combat

this, various techniques have been developed, which aim to retain all useful

information present in the majority class by removing redundant noisy , and/or

borderline instances from the dataset. Redundant instances are considered safe

to remove as they, by definition, do not add any information about the majority

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home