Information Technology Reference
In-Depth Information
2.2.1 What is an Imbalanced Dataset and What is Its Impact on Learning?
We begin with a discussion of the most fundamental question: “What is meant by
imbalanced data and imbalanced learning?” Initially, we focus on classification
problems, and in this context, learning from imbalanced data means learning from
data in which the classes have unequal numbers of examples. But because virtu-
ally no datasets are perfectly balanced, this is not a very useful definition. There
is no agreement, or standard, concerning the exact degree of class imbalance
required for a dataset to be considered truly “imbalanced.” But most practition-
ers would certainly agree that a dataset where the most common class is less than
twice as common as the rarest class would only be marginally unbalanced, that
datasets with the imbalance ratio about 10 : 1 would be modestly imbalanced,
and datasets with imbalance ratios above 1000 : 1 would be extremely unbal-
anced. But ultimately what we care about is how the imbalance impacts learning,
and, in particular, the ability to learn the rare classes.
Learning performance provides us with an empirical — and objective — means
for determining what should be considered an imbalanced dataset. Figure 2.1,
generated from data in an earlier study that analyzed 26 binary-class datasets [4],
shows how class imbalance impacts minority class classification performance.
Specifically, it shows that the ratio between the minority class and the majority
class error rates is greatest for the most highly imbalanced datasets and decreases
as the amount of class imbalance decreases. Figure 2.1 clearly demonstrates that
class imbalance leads to poorer performance when classifying minority class
examples, as the error rate ratios are above 1.0. This impact is actually quite
severe, as datasets with class imbalances between 5 : 1 and 10 : 1 have a minority
class error rate more than 10 times that of the error rate on the majority class.
The impact even appears quite significant for class imbalances between 1 : 1 and
3 : 1, which indicates that class imbalance is problematic in more situations than
commonly acknowledged. This suggests that we should consider datasets with
even moderate levels of class imbalance (e.g., 2 : 1) as “suffering” from class
imbalance.
25
20
15
10
5
0
>10: 1
5: 1-10: 1
Class imbalance
3: 1-5: 1
1: 1-3: 1
Figure 2.1
Impact of class imbalance on minority class performance.
Search WWH ::




Custom Search