FOUNDATIONS OF IMBALANCED LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

boosting may suffer from the same problems as oversampling (e.g., overfitting),

as boosting will tend to weight examples belonging to the rare classes more

than those belonging to the common classes — effectively duplicating some of

the examples belonging to the rare classes. Instead of changing the distribution

of training data by updating the weights associated with each example,

SMOTEBoost alters the distribution by adding new minority class examples

using the SMOTE algorithm [33].

2.4.3.5 Learn Only the Rare Class The problem of relative rarity often causes

the rare classes to be ignored by classifiers. One method of addressing this data-

level problem is to employ an algorithm that only learns classification rules for

the rare class, as this will prevent the more common classes from overwhelming

the rarer classes. There are two main variations to this approach. The recognition-

based approach learns only from examples associated with the rare class, thus

recognizing the patterns shared by the training examples, rather than discriminat-

ing between examples belonging to different classes. Several systems have used

such recognition-based methods to learn rare classes [47, 48].

The other approach, which is more common and supported by several learning

algorithms, learns from examples belonging to all classes but first learns rules

to cover the rare classes [15, 49, 50]. Note that this approach avoids most of

the problems with data fragmentation, as examples belonging to the rare classes

will not be allocated to the rules associated with the common classes before any

rules are formed that cover the rare classes. Such methods are also free to focus

only on the performance of the rules associated with the rare class and not worry

about how this affects the overall performance of the classifier [15, 50]. Probably

the most popular such algorithm is the Ripper algorithm [49], which builds rules

using a separate-and-conquer approach. Ripper normally generates rules for each

class from the rarest class to the most common class. At each stage, it grows rules

for the one targeted class by adding conditions until no examples are covered,

which belong to the other classes. This leads to highly specialized rules, which

are good for covering rare cases. Ripper then covers the most common class

using a default rule that is used when no other rule is applicable.

2.4.3.6 Algorithms for Mining Rare Items Association rule mining is a well-

understood area. However, when metrics other than support and confidence are

used to identify item sets or their association rules, algorithmic changes are

required. In Section 2.4.1, we briefly discussed a variety of metrics for finding

association rules when additional metrics are added to support and confidence.

We did not describe the corresponding changes to the association rule mining

algorithms, but they are described in detail in the relevant papers [21 - 26].

There is also an algorithmic solution to the rare item problem, in which sig-

nificant associations between rarely occurring items may be missed because the

minimum support value minsup cannot be set too low, as a very low value would

cause a combinatorial explosion of associations. This problem can be solved by

specifying multiple minimum levels of support to reflect the frequencies of the

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home