FOUNDATIONS OF IMBALANCED LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

Rare case

Small disjunct

Large disjunct

Common case

−

+

−

+

−

+

−

+

−

+

−

Figure 2.3

Relationship between rare/common cases and small/large disjuncts.

the data points outside these boundaries should be labeled as negative. The

training examples are shown using the plus (“+”) and the minus (“ − ”) symbols.

Note that the classifier will have misclassification errors on future test examples,

as the boundaries for the rare and the common cases do not match the decision

boundaries, represented by the dashed rectangles, which are formed by the

classifier. Because approximately 50% of the decision boundary for the small

disjunct falls outside the rare case, we expect this small disjunct to have an error

rate near 50%. Applying similar reasoning, the error rate of the large disjunct in

this case will only be about 10%. Because the uncertainty in this noise-free case

mainly manifests itself near the decision boundaries, in such cases, we generally

expect the small disjuncts to have a higher error rate, as a higher proportion of

its “area” is near the decision boundary of the case to be learned. The difference

between the induced decision boundaries and the actual decision boundaries in

this case is mainly due to an insufficient number of training examples, although

the bias of the learner also plays a role. In real-world situations, other factors,

such as noise, will also have an effect.

The pattern of small disjuncts having a much higher error rates than large

disjuncts, suggested by Figure 2.3, has been observed in practice in numerous

studies [7 - 13]. This pattern is shown in Figure 2.4 for the classifier induced by

C4.5 from the move dataset [13]. Pruning was disabled in this case as pruning

has been shown to obscure the effect of small disjuncts on learning [12]. The

disjunct size, specified on the x -axis, is determined by the number of training

examples correctly classified by the disjunct (i.e., leaf node). The impact of the

error-prone small disjuncts on learning is actually much greater than suggested

by Figure 2.4, as the disjuncts of size 0 - 3, which correspond to the left-most

bar in the figure, cover about 50% of the total examples and 70% of the errors.

In summary, we see that both rare classes and rare cases are difficult to learn

and both lead to difficulties when learning from imbalanced data. When we

discuss the foundational issues associated with learning from imbalanced data,

we will see that these two difficulties are connected, in that, rare classes are

disproportionately made up of rare cases.

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home