FOUNDATIONS OF IMBALANCED LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

−

(a)

(b)

Figure 2.6

The effect of noise on rare cases. (a) No noisy data and (b) few noisy

examples.

a problem only because learning algorithms cannot effectively handle such data.

This is a very fundamental point, but one that is not often acknowledged.

2.3.3 Algorithm-Level Issues

There are a variety of algorithm-level issues that impact the ability to learn from

imbalanced data. One such issue is the inability of some algorithms to optimize

learning for the target evaluation criteria. Although this is a general issue with

learning, it affects imbalanced data to a much greater extent than balanced data

because in the imbalanced case, the evaluation criteria typically diverge much

further from the standard evaluation metric — accuracy. In fact, most algorithms

are still designed and tested much more thoroughly for accuracy optimization

than for the optimization of other evaluation metrics. This issue is impacted by

the metrics used to guide the heuristic search process. For example, decision trees

are generally formed in a top - down manner and the tree construction process

focuses on selecting the best test condition to expand the extremities of the tree.

The quality of the test condition (i.e., the condition used to split the data at the

node) is usually determined by the “purity” of a split, which is often computed as

the weighted average of the purity values of each branch, where the weights are

determined by the fraction of examples that follow that branch. These metrics,

such as information gain, prefer test conditions that result in a balanced tree,

where purity is increased for most of the examples, in contrast to test conditions

that yield high purity for a relatively small subset of the data but low purity

for the rest [15]. The problem with this is that a single high purity branch that

covers only a few examples may identify a rare case. Thus, such search heuristics

are biased against identifying highly accurate rare cases, which will also impact

their performance on rare classes (which, as discussed earlier, often comprise

rare cases).

The bias of a learning algorithm, which is required if the algorithm is to gen-

eralize from the data, can also cause problems when learning from imbalanced

Search WWH ::

Custom Search

Home