FOUNDATIONS OF IMBALANCED LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

concepts that contain within-class imbalances (i.e., rare cases) [5]. Only when

the within-class imbalances are also eliminated, the concept can be learned well.

This suggests that sampling should be used to improve the performance associ-

ated with rare cases. Unfortunately, there are problems with implementing the

strategy for real-world domains where one cannot identify the rare cases. The

closest we can get to this approach is to assume that rare cases correspond to

small disjuncts in the induced classifier and then sample based on disjunct size,

with the goal of equalizing the sizes of the disjuncts in the induced classifier.

2.4.3 Algorithm-Level Methods

A number of algorithm-level methods have been developed to handle imbalanced

data. The majority of these techniques involve using search methods that are well

suited for identifying rare patterns in data when common patterns abound.

2.4.3.1 Search Methods that Avoid Greed and Recursive Partitioning Greedy

search methods and search methods that use a divide-and-conquer approach to

recursively partition the search space have difficulty in finding rare patterns, for

the reasons provided in Section 2.3.3. Thus, learning methods that avoid, or min-

imize, these two approaches will tend to perform better when there is imbalanced

data. The advances in computational power that have occurred since many of the

basic learning methods were introduced make it more practical to utilize less

greedy search heuristics. Perhaps the best example of this is genetic algorithms,

which are global search techniques that work with populations of candidate solu-

tions rather than a single solution and employ stochastic operators to guide the

search process [36]. These methods tend to be far less greedy than many popu-

lar learning methods and these characteristics permit genetic algorithms to cope

well with attribute interactions [36, 37] and avoid getting stuck in local maxima,

which together make genetic algorithms very suitable for dealing with rarity. In

addition, genetic algorithms also do not rely on a divide-and- conquer approach

that leads to the data fragmentation problem. Several systems have relied on the

power of genetic algorithms to handle rarity. Weiss [38] uses a genetic algorithm

to predict very rare events, while Carvalho and Freitas [39, 40] use a genetic

algorithm to discover “small disjunct rules.” Certainly other search methods are

less greedy than decision trees and also do not suffer from the data fragmen-

tation problems. However, no truly comprehensive study has examined a large

variety of different search methods over a large variety of imbalanced datasets;

so definitive conclusions cannot be drawn. Such studies would be useful and it is

interesting to note that these types of large-scale empirical studies have been con-

ducted to compare the effectiveness of sampling methods — which have garnered

much more focused attention from the imbalanced data community.

2.4.3.2 Search Methods that Use Metrics Designed to Handle Imbalanced Data

One problem-level method for handling class imbalance involves using evaluation

metrics that properly value the learned/mined knowledge. However, evaluation

Search WWH ::

Custom Search

Home