Information Technology Reference
In-Depth Information
metrics also play a role at the algorithm level to guide the heuristic search pro-
cess. Some metrics have been developed to improve this search process when
dealing with imbalanced data — most notably metrics based on precision and
recall. Search methods that focus on simultaneously maximizing precision and
recall may fail because of the difficulty of optimizing these competing values; so
some systems adopt more sophisticated approaches. Timeweaver [38], a genetic
algorithm-based classification system, periodically modifies the parameter to the
F -measure that controls the relative importance of precision and recall in the
fitness function, so that a diverse set of classification rules is evolved, with some
rules having high precision and others high recall. The expectation is that this will
eventually lead to rules with both high precision and recall. A second approach
optimizes recall in the first phase of the search process and precision in the sec-
ond phase by eliminating false positives covered by the rules [41]. Returning to
the needle and haystack analogy, this approach identifies regions likely to contain
needles in the first phase and then discards strands of hay within these regions
in the second phase.
2.4.3.3 Inductive Biases Better Suited for Imbalanced Data Most inductive
learning systems heavily favor generality over specialization. While an induc-
tive bias that favors generality is appropriate for learning common cases, it
is not appropriate for rare cases and may even cause rare cases to be totally
ignored. There have been several attempts to improve the performance of data-
mining systems with respect to rarity by choosing a more appropriate bias. The
simplest approach involves modifying existing systems to eliminate some small
disjuncts based on tests of statistical significance or using error estimation tech-
niques — often as a part of an overfitting avoidance strategy. The hope is that
these will remove only improperly learned disjuncts, but such methods will also
remove those disjuncts formed to cover rare cases. The basic problem is that the
significance of small disjuncts cannot be reliably estimated and consequently sig-
nificant small disjuncts may be eliminated along with the insignificant ones. Error
estimation techniques are also unreliable when there are only a few examples,
and hence they suffer from the same basic problem. These approaches work well
for large disjuncts because in these cases statistical significance and error rate
estimation techniques yield relatively reliable estimates — something they do not
do for small disjuncts.
More sophisticated approaches have been developed but the impact of these
strategies on rare cases cannot be measured directly, as the rare cases in the true
concept are generally not known. Furthermore, in early work on this topic, the
focus was on the performance of small disjuncts; so it is difficult to assess the
impact of these strategies on class imbalance. In one study, the learner's maxi-
mum generality bias was replaced with a maximum specificity bias for the small
disjuncts, which improved the performance of the small disjuncts but degraded
the performance of the larger disjuncts and the overall accuracy [8]. Another
study also utilized a maximum specificity bias but took steps to ensure that this
Search WWH ::




Custom Search