FOUNDATIONS OF IMBALANCED LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

as costly as false positives by assigning appropriate costs or by increasing the

ratio of positive to negative examples in the training set by a factor of 2, or

by setting the probability threshold for determining the class label to two-thirds

rather than to one-half. Unfortunately, as implemented in real-world situations,

these equivalences do not hold.

As a concrete example, suppose that a training set has 10,000 examples and

a class distribution of 100 : 1, so that there are only 100 positive examples. One

way to improve the identification of the rare class is to impose a greater cost

for false negatives than for false positives. A cost ratio of 100 : 1 is theoretically

equivalent to modifying the training distribution, so that it is balanced, with a

1 : 1 class ratio. To generate such a balanced distribution in practice, one would

typically oversample the minority class or undersample the majority class, or do

both. But if one undersamples the majority class, then potentially valuable data

is thrown away, and if one oversamples the minority class, then one is making

exact copies of examples, which can lead to overfitting. For the equivalence to

hold, one should randomly select new minority class examples from the orig-

inal distribution, which would include examples that are not already available

for training. But this is almost never feasible. Even generating new, synthetic,

minority class examples violates the equivalence, as these examples will, at best,

only be a better approximation of the true distribution. Thus, sampling methods

are not equivalent in practice to other methods for dealing with imbalanced data

and they have drawbacks that other methods, such as cost-sensitive learning, do

not have, if implemented properly.

Another significant concern with sampling is that its impact is often not fully

understood — or even considered. Increasing the proportion of examples belong-

ing to the rare class has two distinct effects. First, it will help address the problems

with relative rarity, and, if the examples are new examples, will also address the

problem with absolute rarity by injecting new knowledge. However, if no correc-

tive action is taken, it will also have a second effect — it will impose nonuniform

error costs, causing the learner to be biased in favor of predicting the rare class.

In many situations, this second effect is desired and is the actually the main

reason for altering the class distribution of the training data. But in other cases,

namely when new examples are added (e.g., via active learning), this effect is

not desirable. That is, in these other cases, the intent is to improve performance

with respect to the rare class by having more data available for that class, not

by biasing the data-mining algorithm toward that class. In these cases, this bias

should be removed.

The bias introduced toward predicting the oversampled class can be removed

using the equivalences noted earlier to account for the differences between the

training distribution and the underlying distribution [4, 43]. For example, the

bias can be removed by adjusting the decision thresholds, as was done in one

study that demonstrated the positive impact of removing this unintended bias [4].

That study showed that adding new examples to alter the class distribution of

the training data, so that it deviates from the natural, underlying, distribution,

improved classifier performance. However, classifier performance was improved

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home