FOUNDATIONS OF IMBALANCED LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

classes or rare cases. Unfortunately, this cannot easily be done directly as one

cannot identify examples belonging to rare classes and rare cases with certainty.

But there is an expectation that active learning strategies will tend to preferen-

tially sample such examples. For example, uncertainty sampling methods [28]

are likely to focus more attention on rare cases, which will generally yield less

certain predictions because of the smaller number of training examples to gen-

eralize from. Put another way, as small disjuncts have a much higher error rate

than large disjuncts, it seems clear that active learning methods would focus on

obtaining examples belonging to those disjuncts. Other work on active learning

has further demonstrated that active learning methods are capable of preferen-

tially sampling the rare classes by focusing the learning on the instances around

the classification boundary [29]. This general information acquisition strategy is

supported by the empirical evidence that shows that balanced class distributions

generally yield better performance than unbalanced ones [4].

Active learning and other simpler information acquisition strategies can also

assist with the relative rarity problem, as such strategies, which acquire examples

belonging to the rarer classes and rarer cases, address the relative rarity prob-

lem while addressing the absolute rarity problem. Note that this is true even if

uncertainty sampling methods tend to acquire examples belonging to rare cases,

as prior work has shown that rare cases tend to be more associated with the rarer

classes [4]. In fact, this method for dealing with relative rarity is to be preferred

to the sampling methods addressed next, as those methods do not obtain new

knowledge (i.e., valid new training examples).

2.4.2.2 Sampling Methods Sampling methods are a very popular method for

dealing with imbalanced data. These methods are primarily employed to address

the problem with relative rarity but do not address the issue of absolute rarity.

This is because, with the exception of some methods that utilize some intelligence

to generate new examples, these methods do not attack the underlying issue with

absolute rarity — a lack of examples belonging to the rare classes and rare cases.

But, as will be discussed in Section 2.4.3, our view is also that sampling methods

do not address the underlying problem with relative rarity either. Rather, sampling

masks the underlying problem by artificially balancing the data, without solving

the basic underlying issue. The proper solution is at the algorithm level and

requires algorithms that are designed to handle imbalanced data.

The most basic sampling methods are random undersampling and random

oversampling. Random undersampling randomly eliminates majority class

examples from the training data, while random oversampling randomly

duplicates minority class training examples. Both of these sampling techniques

decrease the degree of class imbalance. But as no new information is introduced,

any underlying issues with absolute rarity are not addressed. Some studies

have shown random oversampling to be ineffective at improving recognition

of the minority class [30, 31], while another study has shown that random

undersampling is ineffective [32]. These two sampling methods also have

significant drawbacks. Undersampling discards potentially useful majority class

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home