Information Technology Reference
In-Depth Information
concepts that contain within-class imbalances (i.e., rare cases) [5]. Only when
the within-class imbalances are also eliminated, the concept can be learned well.
This suggests that sampling should be used to improve the performance associ-
ated with rare cases. Unfortunately, there are problems with implementing the
strategy for real-world domains where one cannot identify the rare cases. The
closest we can get to this approach is to assume that rare cases correspond to
small disjuncts in the induced classifier and then sample based on disjunct size,
with the goal of equalizing the sizes of the disjuncts in the induced classifier.
2.4.3 Algorithm-Level Methods
A number of algorithm-level methods have been developed to handle imbalanced
data. The majority of these techniques involve using search methods that are well
suited for identifying rare patterns in data when common patterns abound.
2.4.3.1 Search Methods that Avoid Greed and Recursive Partitioning Greedy
search methods and search methods that use a divide-and-conquer approach to
recursively partition the search space have difficulty in finding rare patterns, for
the reasons provided in Section 2.3.3. Thus, learning methods that avoid, or min-
imize, these two approaches will tend to perform better when there is imbalanced
data. The advances in computational power that have occurred since many of the
basic learning methods were introduced make it more practical to utilize less
greedy search heuristics. Perhaps the best example of this is genetic algorithms,
which are global search techniques that work with populations of candidate solu-
tions rather than a single solution and employ stochastic operators to guide the
search process [36]. These methods tend to be far less greedy than many popu-
lar learning methods and these characteristics permit genetic algorithms to cope
well with attribute interactions [36, 37] and avoid getting stuck in local maxima,
which together make genetic algorithms very suitable for dealing with rarity. In
addition, genetic algorithms also do not rely on a divide-and- conquer approach
that leads to the data fragmentation problem. Several systems have relied on the
power of genetic algorithms to handle rarity. Weiss [38] uses a genetic algorithm
to predict very rare events, while Carvalho and Freitas [39, 40] use a genetic
algorithm to discover “small disjunct rules.” Certainly other search methods are
less greedy than decision trees and also do not suffer from the data fragmen-
tation problems. However, no truly comprehensive study has examined a large
variety of different search methods over a large variety of imbalanced datasets;
so definitive conclusions cannot be drawn. Such studies would be useful and it is
interesting to note that these types of large-scale empirical studies have been con-
ducted to compare the effectiveness of sampling methods — which have garnered
much more focused attention from the imbalanced data community.
2.4.3.2 Search Methods that Use Metrics Designed to Handle Imbalanced Data
One problem-level method for handling class imbalance involves using evaluation
metrics that properly value the learned/mined knowledge. However, evaluation
Search WWH ::




Custom Search