FOUNDATIONS OF IMBALANCED LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

There are a few subtle points concerning class imbalance. First, class imbal-

ance must be defined with respect to a particular dataset or distribution. Since

class labels are required in order to determine the degree of class imbalance,

class imbalance is typically gauged with respect to the training distribution. If

the training distribution is representative of the underlying distribution, as it is

often assumed, then there is no problem; but if this is not the case, then we

cannot conclude that the underlying distribution is imbalanced. But the situation

can be complicated by the fact that when dealing with class imbalance, a com-

mon strategy is to artificially balance the training set. In this case, do we have

class imbalance or not? The answer in this case is “yes” — we still do have class

imbalance. That is, when discussing the problems associated with class imbal-

ance, we really care about the underlying distribution. Artificially balancing the

training distribution may help with the effects of class imbalance, but does not

remove the underlying problem.

A second point concerns the fact that while class imbalance literally refers to

the relative proportions of examples belonging to each class, the absolute number

of examples available for learning is clearly very important. Thus, the class imbal-

ance problem for a dataset with 10,000 positive examples and 1,000,000 negative

examples is clearly quite different from a dataset with 10 positive examples and

1000 negative examples — even though the class proportions are identical. These

two problems can be referred to as problems with relative rarity and absolute

rarity. A dataset may suffer from neither of these problems, one of these prob-

lems, or both of these problems. We discuss the issue of absolute rarity in the

context of class imbalance because highly imbalanced datasets very often have

problems with absolute rarity.

2.2.2 Between-Class Imbalance, Rare Cases, and Small Disjuncts

Thus far we have been discussing class imbalance, or, as it has been termed,

between-class imbalance. A second type of imbalance, which is not quite as

well known or extensively studied, is within-class imbalance [5, 6]. Within-class

imbalance is the result of rare cases [7] in the true, but generally unknown,

classification concept to be learned. More specifically, rare cases correspond

to sub-concepts in the induced classifier that covers relatively few cases. For

example, in a medical dataset containing patient data where each patient is labeled

as “sick” or “healthy,” a rare case might correspond to those sick patients suffer-

ing from botulism, a relatively rare illness. In this domain, within-class imbalance

occurs within the “sick” class because of the presence of much more general

cases, such as those corresponding to the common cold. Just as the minority

class in an imbalanced dataset is very hard to learn well, the rare cases are also

hard to learn — even if they are part of the majority class. This difficulty is much

harder to measure than the difficulty with learning the rare class, as rare cases can

only be defined with respect to the classification concept, which, for real-world

problems, is unknown, and can only be approximated. However, the difficulty of

learning rare cases can be measured using artificial datasets that are generated

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home