Information Technology Reference
In-Depth Information
boosting may suffer from the same problems as oversampling (e.g., overfitting),
as boosting will tend to weight examples belonging to the rare classes more
than those belonging to the common classes — effectively duplicating some of
the examples belonging to the rare classes. Instead of changing the distribution
of training data by updating the weights associated with each example,
SMOTEBoost alters the distribution by adding new minority class examples
using the SMOTE algorithm [33].
2.4.3.5 Learn Only the Rare Class The problem of relative rarity often causes
the rare classes to be ignored by classifiers. One method of addressing this data-
level problem is to employ an algorithm that only learns classification rules for
the rare class, as this will prevent the more common classes from overwhelming
the rarer classes. There are two main variations to this approach. The recognition-
based approach learns only from examples associated with the rare class, thus
recognizing the patterns shared by the training examples, rather than discriminat-
ing between examples belonging to different classes. Several systems have used
such recognition-based methods to learn rare classes [47, 48].
The other approach, which is more common and supported by several learning
algorithms, learns from examples belonging to all classes but first learns rules
to cover the rare classes [15, 49, 50]. Note that this approach avoids most of
the problems with data fragmentation, as examples belonging to the rare classes
will not be allocated to the rules associated with the common classes before any
rules are formed that cover the rare classes. Such methods are also free to focus
only on the performance of the rules associated with the rare class and not worry
about how this affects the overall performance of the classifier [15, 50]. Probably
the most popular such algorithm is the Ripper algorithm [49], which builds rules
using a separate-and-conquer approach. Ripper normally generates rules for each
class from the rarest class to the most common class. At each stage, it grows rules
for the one targeted class by adding conditions until no examples are covered,
which belong to the other classes. This leads to highly specialized rules, which
are good for covering rare cases. Ripper then covers the most common class
using a default rule that is used when no other rule is applicable.
2.4.3.6 Algorithms for Mining Rare Items Association rule mining is a well-
understood area. However, when metrics other than support and confidence are
used to identify item sets or their association rules, algorithmic changes are
required. In Section 2.4.1, we briefly discussed a variety of metrics for finding
association rules when additional metrics are added to support and confidence.
We did not describe the corresponding changes to the association rule mining
algorithms, but they are described in detail in the relevant papers [21 - 26].
There is also an algorithmic solution to the rare item problem, in which sig-
nificant associations between rarely occurring items may be missed because the
minimum support value minsup cannot be set too low, as a very low value would
cause a combinatorial explosion of associations. This problem can be solved by
specifying multiple minimum levels of support to reflect the frequencies of the
Search WWH ::




Custom Search