Information Technology Reference
In-Depth Information
classes or rare cases. Unfortunately, this cannot easily be done directly as one
cannot identify examples belonging to rare classes and rare cases with certainty.
But there is an expectation that active learning strategies will tend to preferen-
tially sample such examples. For example, uncertainty sampling methods [28]
are likely to focus more attention on rare cases, which will generally yield less
certain predictions because of the smaller number of training examples to gen-
eralize from. Put another way, as small disjuncts have a much higher error rate
than large disjuncts, it seems clear that active learning methods would focus on
obtaining examples belonging to those disjuncts. Other work on active learning
has further demonstrated that active learning methods are capable of preferen-
tially sampling the rare classes by focusing the learning on the instances around
the classification boundary [29]. This general information acquisition strategy is
supported by the empirical evidence that shows that balanced class distributions
generally yield better performance than unbalanced ones [4].
Active learning and other simpler information acquisition strategies can also
assist with the relative rarity problem, as such strategies, which acquire examples
belonging to the rarer classes and rarer cases, address the relative rarity prob-
lem while addressing the absolute rarity problem. Note that this is true even if
uncertainty sampling methods tend to acquire examples belonging to rare cases,
as prior work has shown that rare cases tend to be more associated with the rarer
classes [4]. In fact, this method for dealing with relative rarity is to be preferred
to the sampling methods addressed next, as those methods do not obtain new
knowledge (i.e., valid new training examples).
2.4.2.2 Sampling Methods Sampling methods are a very popular method for
dealing with imbalanced data. These methods are primarily employed to address
the problem with relative rarity but do not address the issue of absolute rarity.
This is because, with the exception of some methods that utilize some intelligence
to generate new examples, these methods do not attack the underlying issue with
absolute rarity — a lack of examples belonging to the rare classes and rare cases.
But, as will be discussed in Section 2.4.3, our view is also that sampling methods
do not address the underlying problem with relative rarity either. Rather, sampling
masks the underlying problem by artificially balancing the data, without solving
the basic underlying issue. The proper solution is at the algorithm level and
requires algorithms that are designed to handle imbalanced data.
The most basic sampling methods are random undersampling and random
oversampling. Random undersampling randomly eliminates majority class
examples from the training data, while random oversampling randomly
duplicates minority class training examples. Both of these sampling techniques
decrease the degree of class imbalance. But as no new information is introduced,
any underlying issues with absolute rarity are not addressed. Some studies
have shown random oversampling to be ineffective at improving recognition
of the minority class [30, 31], while another study has shown that random
undersampling is ineffective [32]. These two sampling methods also have
significant drawbacks. Undersampling discards potentially useful majority class
Search WWH ::




Custom Search