FOUNDATIONS OF IMBALANCED LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

data. Most learners utilize a bias that encourages generalization and simple mod-

els to avoid the possibility of overfitting the data. But studies have shown that

such biases work well for large disjuncts but not for small disjuncts [8], lead-

ing to the observed problem with small disjuncts (these biases tend to make the

small disjuncts overly general). Inductive bias also plays a role with respect to

rare classes. Many learners prefer the more common classes in the presence of

uncertainty (i.e., they will be biased in favor of the class priors). As a simple

example, imagine a decision tree learner that branches on all possible feature

values when splitting a node in the tree. If one of the resulting branches covers

no training examples, then there is no evidence on which to base a classification.

Most decision tree learners will predict the most frequently occurring class in

this situation, biasing the results against rarer classes.

The algorithm-level issues discussed thus far concern the use of search heuris-

tics and inductive biases that favor the common classes and cases over the rare

classes and cases. But the algorithm-level issues do not just involve favoritism. It

is fundamentally more difficult for an algorithm to identify rare patterns than to

identify relatively common patterns. There may be quite a few instances of the

rare pattern, but the sheer volume of examples belonging to the more common

patterns will obscure the relatively rare patterns. This is perhaps best illustrated

with a variation of a common idiom in English: finding relatively rare patterns

is “like finding needles in a haystack.” The problem in this case is not so much

that there are few needles, but rather that there is so much more hay.

The problem with identifying relatively rare patterns is partly due to the fact

that these patterns are not easily located using the greedy search heuristics that

are in common use. Greedy search heuristics have a problem with relative rarity

because the rare patterns may depend on the conjunction of many conditions,

and therefore examining any single condition in isolation may not provide much

information or guidance. While this may also be true of common objects, with

rare objects the impact is greater because the common objects may obscure the

true signal. As a specific example of this general problem, consider the association

rule mining problem described earlier, where we want to be able to detect the

association between cooking pan and spatula . The problem is that both items are

rarely purchased in a supermarket, so that even if the two are often purchased

together when either one is purchased, this association may not be found. To

find this association, the minimum support threshold for the algorithm would

need to be set quite low. However, if this is done, there will be a combinatorial

explosion because frequently occurring items will be associated with one another

in an enormous number of ways. This association rule mining problem has been

called the rare item problem [14] and it is an analog of the problem of identifying

rare cases in classification problems. The fact that these random co-occurrences

will swamp the meaningful associations between rare items is one example of

the problem with relative rarity.

Another algorithm-level problem is associated with the divide-and-conquer

approach that is used by many classification algorithms, including decision tree

algorithms. Such algorithms repeatedly partition the instance space (and the

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home