Information Technology Reference
In-Depth Information
These results suggest that absolute rarity poses a very serious problem for
learning. But the problem could also be that small disjuncts sometimes do not
represent rare, or exceptional, cases, but instead represent noise. The underlying
problem, then, is that there is no easy way to distinguish between those small
disjuncts that represent rare/exceptional cases, which should be kept, and those
that represent noise, which should be discarded (i.e., pruned).
We have seen that rare cases are difficult to learn because of a lack of training
examples. It is generally assumed that rare classes are difficult to learn for similar
reasons. But in theory, it could be that rare classes are not disproportionately
made up of rare cases, when compared to the makeup of common classes. But
one study showed that this is most likely not the case as, across 26 datasets, the
disjuncts labeled with the minority class were much smaller than the disjuncts
with majority class labels [4]. Thus, rare classes tend to be made up of more rare
cases (on the assumption that rare cases form small disjuncts) and as these are
harder to learn than common cases, the minority class will tend to be harder to
learn than the majority class. This effect is therefore due to an absolute lack of
training examples for the minority class.
Another factor that may exacerbate any issues that already exist with
imbalanced data is noise . While noisy data is a general issue for learning, its
impact is magnified when there is imbalanced data. In fact, we expect noise to
have a greater impact on rare cases than on common cases. To see this, consider
Figure 2.6. Figure 2.6a includes no noisy data, while Figure 2.6b includes a
few noisy examples. In this case, a decision tree classifier is used, which is
configured to require at least two examples at the terminal nodes as a means of
overfitting avoidance. We see that in Figure 2.6b, when one of the two training
examples in the rare positive case is erroneously labeled as belonging to the
negative class, the classifier misses the rare case completely, as two positive
training examples are required to generate a leaf node. The less rare positive
case, however, is not significantly affected because most of the examples in
the induced disjunct are still positive and the two erroneously labeled training
examples are not sufficient to alter the decision boundaries. Thus, noise will
have a more significant impact on the rare cases than on the common cases.
Another way to look at things is that it will be hard to distinguish between rare
cases and noisy data points. Pruning, which is often used to combat noise, will
remove the rare cases and the noisy cases together.
It is worth noting that while this section highlights the problem with absolute
rarity, it does not highlight the problem with relative rarity. This is because we
view relative rarity as an issue associated with the algorithm level. The reason
is that class imbalance, which generally focuses on the relative differences in
class proportions, is not fundamentally a problem at the data level — it is simply
a property of the data distribution. We maintain that the problems associated
with class imbalance and relative rarity are due to the lack of a proper problem
formulation (with accurate evaluation criteria) or with algorithmic limitations with
existing learning methods. The key point is that relative rarity/class imbalance is
Search WWH ::




Custom Search