FOUNDATIONS OF IMBALANCED LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

These results suggest that absolute rarity poses a very serious problem for

learning. But the problem could also be that small disjuncts sometimes do not

represent rare, or exceptional, cases, but instead represent noise. The underlying

problem, then, is that there is no easy way to distinguish between those small

disjuncts that represent rare/exceptional cases, which should be kept, and those

that represent noise, which should be discarded (i.e., pruned).

We have seen that rare cases are difficult to learn because of a lack of training

examples. It is generally assumed that rare classes are difficult to learn for similar

reasons. But in theory, it could be that rare classes are not disproportionately

made up of rare cases, when compared to the makeup of common classes. But

one study showed that this is most likely not the case as, across 26 datasets, the

disjuncts labeled with the minority class were much smaller than the disjuncts

with majority class labels [4]. Thus, rare classes tend to be made up of more rare

cases (on the assumption that rare cases form small disjuncts) and as these are

harder to learn than common cases, the minority class will tend to be harder to

learn than the majority class. This effect is therefore due to an absolute lack of

training examples for the minority class.

Another factor that may exacerbate any issues that already exist with

imbalanced data is noise . While noisy data is a general issue for learning, its

impact is magnified when there is imbalanced data. In fact, we expect noise to

have a greater impact on rare cases than on common cases. To see this, consider

Figure 2.6. Figure 2.6a includes no noisy data, while Figure 2.6b includes a

few noisy examples. In this case, a decision tree classifier is used, which is

configured to require at least two examples at the terminal nodes as a means of

overfitting avoidance. We see that in Figure 2.6b, when one of the two training

examples in the rare positive case is erroneously labeled as belonging to the

negative class, the classifier misses the rare case completely, as two positive

training examples are required to generate a leaf node. The less rare positive

case, however, is not significantly affected because most of the examples in

the induced disjunct are still positive and the two erroneously labeled training

examples are not sufficient to alter the decision boundaries. Thus, noise will

have a more significant impact on the rare cases than on the common cases.

Another way to look at things is that it will be hard to distinguish between rare

cases and noisy data points. Pruning, which is often used to combat noise, will

remove the rare cases and the noisy cases together.

It is worth noting that while this section highlights the problem with absolute

rarity, it does not highlight the problem with relative rarity. This is because we

view relative rarity as an issue associated with the algorithm level. The reason

is that class imbalance, which generally focuses on the relative differences in

class proportions, is not fundamentally a problem at the data level — it is simply

a property of the data distribution. We maintain that the problems associated

with class imbalance and relative rarity are due to the lack of a proper problem

formulation (with accurate evaluation criteria) or with algorithmic limitations with

existing learning methods. The key point is that relative rarity/class imbalance is

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home