FOUNDATIONS OF IMBALANCED LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

metrics also play a role at the algorithm level to guide the heuristic search pro-

cess. Some metrics have been developed to improve this search process when

dealing with imbalanced data — most notably metrics based on precision and

recall. Search methods that focus on simultaneously maximizing precision and

recall may fail because of the difficulty of optimizing these competing values; so

some systems adopt more sophisticated approaches. Timeweaver [38], a genetic

algorithm-based classification system, periodically modifies the parameter to the

F -measure that controls the relative importance of precision and recall in the

fitness function, so that a diverse set of classification rules is evolved, with some

rules having high precision and others high recall. The expectation is that this will

eventually lead to rules with both high precision and recall. A second approach

optimizes recall in the first phase of the search process and precision in the sec-

ond phase by eliminating false positives covered by the rules [41]. Returning to

the needle and haystack analogy, this approach identifies regions likely to contain

needles in the first phase and then discards strands of hay within these regions

in the second phase.

2.4.3.3 Inductive Biases Better Suited for Imbalanced Data Most inductive

learning systems heavily favor generality over specialization. While an induc-

tive bias that favors generality is appropriate for learning common cases, it

is not appropriate for rare cases and may even cause rare cases to be totally

ignored. There have been several attempts to improve the performance of data-

mining systems with respect to rarity by choosing a more appropriate bias. The

simplest approach involves modifying existing systems to eliminate some small

disjuncts based on tests of statistical significance or using error estimation tech-

niques — often as a part of an overfitting avoidance strategy. The hope is that

these will remove only improperly learned disjuncts, but such methods will also

remove those disjuncts formed to cover rare cases. The basic problem is that the

significance of small disjuncts cannot be reliably estimated and consequently sig-

nificant small disjuncts may be eliminated along with the insignificant ones. Error

estimation techniques are also unreliable when there are only a few examples,

and hence they suffer from the same basic problem. These approaches work well

for large disjuncts because in these cases statistical significance and error rate

estimation techniques yield relatively reliable estimates — something they do not

do for small disjuncts.

More sophisticated approaches have been developed but the impact of these

strategies on rare cases cannot be measured directly, as the rare cases in the true

concept are generally not known. Furthermore, in early work on this topic, the

focus was on the performance of small disjuncts; so it is difficult to assess the

impact of these strategies on class imbalance. In one study, the learner's maxi-

mum generality bias was replaced with a maximum specificity bias for the small

disjuncts, which improved the performance of the small disjuncts but degraded

the performance of the larger disjuncts and the overall accuracy [8]. Another

study also utilized a maximum specificity bias but took steps to ensure that this

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home