Information Technology Reference
In-Depth Information
There are a few subtle points concerning class imbalance. First, class imbal-
ance must be defined with respect to a particular dataset or distribution. Since
class labels are required in order to determine the degree of class imbalance,
class imbalance is typically gauged with respect to the training distribution. If
the training distribution is representative of the underlying distribution, as it is
often assumed, then there is no problem; but if this is not the case, then we
cannot conclude that the underlying distribution is imbalanced. But the situation
can be complicated by the fact that when dealing with class imbalance, a com-
mon strategy is to artificially balance the training set. In this case, do we have
class imbalance or not? The answer in this case is “yes” — we still do have class
imbalance. That is, when discussing the problems associated with class imbal-
ance, we really care about the underlying distribution. Artificially balancing the
training distribution may help with the effects of class imbalance, but does not
remove the underlying problem.
A second point concerns the fact that while class imbalance literally refers to
the relative proportions of examples belonging to each class, the absolute number
of examples available for learning is clearly very important. Thus, the class imbal-
ance problem for a dataset with 10,000 positive examples and 1,000,000 negative
examples is clearly quite different from a dataset with 10 positive examples and
1000 negative examples — even though the class proportions are identical. These
two problems can be referred to as problems with relative rarity and absolute
rarity. A dataset may suffer from neither of these problems, one of these prob-
lems, or both of these problems. We discuss the issue of absolute rarity in the
context of class imbalance because highly imbalanced datasets very often have
problems with absolute rarity.
2.2.2 Between-Class Imbalance, Rare Cases, and Small Disjuncts
Thus far we have been discussing class imbalance, or, as it has been termed,
between-class imbalance. A second type of imbalance, which is not quite as
well known or extensively studied, is within-class imbalance [5, 6]. Within-class
imbalance is the result of rare cases [7] in the true, but generally unknown,
classification concept to be learned. More specifically, rare cases correspond
to sub-concepts in the induced classifier that covers relatively few cases. For
example, in a medical dataset containing patient data where each patient is labeled
as “sick” or “healthy,” a rare case might correspond to those sick patients suffer-
ing from botulism, a relatively rare illness. In this domain, within-class imbalance
occurs within the “sick” class because of the presence of much more general
cases, such as those corresponding to the common cold. Just as the minority
class in an imbalanced dataset is very hard to learn well, the rare cases are also
hard to learn — even if they are part of the majority class. This difficulty is much
harder to measure than the difficulty with learning the rare class, as rare cases can
only be defined with respect to the classification concept, which, for real-world
problems, is unknown, and can only be approximated. However, the difficulty of
learning rare cases can be measured using artificial datasets that are generated
Search WWH ::




Custom Search