Information Technology Reference
In-Depth Information
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
(a)
(b)
Figure 2.6
The effect of noise on rare cases. (a) No noisy data and (b) few noisy
examples.
a problem only because learning algorithms cannot effectively handle such data.
This is a very fundamental point, but one that is not often acknowledged.
2.3.3 Algorithm-Level Issues
There are a variety of algorithm-level issues that impact the ability to learn from
imbalanced data. One such issue is the inability of some algorithms to optimize
learning for the target evaluation criteria. Although this is a general issue with
learning, it affects imbalanced data to a much greater extent than balanced data
because in the imbalanced case, the evaluation criteria typically diverge much
further from the standard evaluation metric — accuracy. In fact, most algorithms
are still designed and tested much more thoroughly for accuracy optimization
than for the optimization of other evaluation metrics. This issue is impacted by
the metrics used to guide the heuristic search process. For example, decision trees
are generally formed in a top - down manner and the tree construction process
focuses on selecting the best test condition to expand the extremities of the tree.
The quality of the test condition (i.e., the condition used to split the data at the
node) is usually determined by the “purity” of a split, which is often computed as
the weighted average of the purity values of each branch, where the weights are
determined by the fraction of examples that follow that branch. These metrics,
such as information gain, prefer test conditions that result in a balanced tree,
where purity is increased for most of the examples, in contrast to test conditions
that yield high purity for a relatively small subset of the data but low purity
for the rest [15]. The problem with this is that a single high purity branch that
covers only a few examples may identify a rare case. Thus, such search heuristics
are biased against identifying highly accurate rare cases, which will also impact
their performance on rare classes (which, as discussed earlier, often comprise
rare cases).
The bias of a learning algorithm, which is required if the algorithm is to gen-
eralize from the data, can also cause problems when learning from imbalanced
Search WWH ::




Custom Search