Information Technology Reference
In-Depth Information
Rare case
Small disjunct
Large disjunct
Common case
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Figure 2.3
Relationship between rare/common cases and small/large disjuncts.
the data points outside these boundaries should be labeled as negative. The
training examples are shown using the plus (“+”) and the minus (“ ”) symbols.
Note that the classifier will have misclassification errors on future test examples,
as the boundaries for the rare and the common cases do not match the decision
boundaries, represented by the dashed rectangles, which are formed by the
classifier. Because approximately 50% of the decision boundary for the small
disjunct falls outside the rare case, we expect this small disjunct to have an error
rate near 50%. Applying similar reasoning, the error rate of the large disjunct in
this case will only be about 10%. Because the uncertainty in this noise-free case
mainly manifests itself near the decision boundaries, in such cases, we generally
expect the small disjuncts to have a higher error rate, as a higher proportion of
its “area” is near the decision boundary of the case to be learned. The difference
between the induced decision boundaries and the actual decision boundaries in
this case is mainly due to an insufficient number of training examples, although
the bias of the learner also plays a role. In real-world situations, other factors,
such as noise, will also have an effect.
The pattern of small disjuncts having a much higher error rates than large
disjuncts, suggested by Figure 2.3, has been observed in practice in numerous
studies [7 - 13]. This pattern is shown in Figure 2.4 for the classifier induced by
C4.5 from the move dataset [13]. Pruning was disabled in this case as pruning
has been shown to obscure the effect of small disjuncts on learning [12]. The
disjunct size, specified on the x -axis, is determined by the number of training
examples correctly classified by the disjunct (i.e., leaf node). The impact of the
error-prone small disjuncts on learning is actually much greater than suggested
by Figure 2.4, as the disjuncts of size 0 - 3, which correspond to the left-most
bar in the figure, cover about 50% of the total examples and 70% of the errors.
In summary, we see that both rare classes and rare cases are difficult to learn
and both lead to difficulties when learning from imbalanced data. When we
discuss the foundational issues associated with learning from imbalanced data,
we will see that these two difficulties are connected, in that, rare classes are
disproportionately made up of rare cases.
Search WWH ::




Custom Search