Information Technology Reference
In-Depth Information
data. Most learners utilize a bias that encourages generalization and simple mod-
els to avoid the possibility of overfitting the data. But studies have shown that
such biases work well for large disjuncts but not for small disjuncts [8], lead-
ing to the observed problem with small disjuncts (these biases tend to make the
small disjuncts overly general). Inductive bias also plays a role with respect to
rare classes. Many learners prefer the more common classes in the presence of
uncertainty (i.e., they will be biased in favor of the class priors). As a simple
example, imagine a decision tree learner that branches on all possible feature
values when splitting a node in the tree. If one of the resulting branches covers
no training examples, then there is no evidence on which to base a classification.
Most decision tree learners will predict the most frequently occurring class in
this situation, biasing the results against rarer classes.
The algorithm-level issues discussed thus far concern the use of search heuris-
tics and inductive biases that favor the common classes and cases over the rare
classes and cases. But the algorithm-level issues do not just involve favoritism. It
is fundamentally more difficult for an algorithm to identify rare patterns than to
identify relatively common patterns. There may be quite a few instances of the
rare pattern, but the sheer volume of examples belonging to the more common
patterns will obscure the relatively rare patterns. This is perhaps best illustrated
with a variation of a common idiom in English: finding relatively rare patterns
is “like finding needles in a haystack.” The problem in this case is not so much
that there are few needles, but rather that there is so much more hay.
The problem with identifying relatively rare patterns is partly due to the fact
that these patterns are not easily located using the greedy search heuristics that
are in common use. Greedy search heuristics have a problem with relative rarity
because the rare patterns may depend on the conjunction of many conditions,
and therefore examining any single condition in isolation may not provide much
information or guidance. While this may also be true of common objects, with
rare objects the impact is greater because the common objects may obscure the
true signal. As a specific example of this general problem, consider the association
rule mining problem described earlier, where we want to be able to detect the
association between cooking pan and spatula . The problem is that both items are
rarely purchased in a supermarket, so that even if the two are often purchased
together when either one is purchased, this association may not be found. To
find this association, the minimum support threshold for the algorithm would
need to be set quite low. However, if this is done, there will be a combinatorial
explosion because frequently occurring items will be associated with one another
in an enormous number of ways. This association rule mining problem has been
called the rare item problem [14] and it is an analog of the problem of identifying
rare cases in classification problems. The fact that these random co-occurrences
will swamp the meaningful associations between rare items is one example of
the problem with relative rarity.
Another algorithm-level problem is associated with the divide-and-conquer
approach that is used by many classification algorithms, including decision tree
algorithms. Such algorithms repeatedly partition the instance space (and the
Search WWH ::




Custom Search