Information Technology Reference
In-Depth Information
approach to handling absolute rarity, in either of its two main forms, is to acquire
additional training examples. This can often be done most efficiently via active
learning and other information acquisition strategies.
It is important to understand that we do not view class imbalance, which
results from a relative difference in frequency between the classes, as a prob-
lem at the data level — the problem only exists because most algorithms do not
respond well to such imbalances. The straightforward method for dealing with
class imbalance is via sampling, a method that operates at the data level. But this
method for dealing with class imbalance has many problems, as we discussed
previously (e.g., undersampling involves discarding potentially useful data) and
is far from ideal. A much better solution would be to develop algorithms that
can handle the class imbalance. At the current moment, sampling methods do
perform competitively and therefore cannot be ignored, but it is important to
recognize that such methods will always have limited value and that algorithmic
solutions can potentially be more effective. We discuss these methods next (e.g.,
one-class learning) because we view them as addressing foundational algorithmic
issues.
Algorithm-level issues mainly involve the ability to find subtle patterns in
data that may be obscured because of imbalanced data and class imbalance,
in particular (i.e., relative rarity). Finding patterns, such as those that identify
examples belonging to a very rare class, is a very difficult task. To accomplish
this task, it is important to have an appropriate search algorithm, a good evaluation
metric to guide the heuristic search process, and an appropriate inductive bias.
It is also important to deal with issues such as data fragmentation, which can
be problematic especially for imbalanced data. The most common mechanism
for dealing with this algorithm-level problem is to use sampling, a data-level
method, to reduce the degree of class imbalance. But for reasons outlined earlier,
this strategy does not address the foundational underlying issue — although it
does provide some benefit. The strategies that function at the algorithm level
include using a non-greedy search algorithm and one that does not repeatedly
partition the search space; using search heuristics that are guided by metrics that
are appropriate for imbalanced data; using inductive biases that are appropriate
for imbalanced data; and using algorithms that explicitly or implicitly focus on
the rare classes or rare cases, or only learn the rare class.
2.6 MISCONCEPTIONS ABOUT SAMPLING METHODS
Sampling methods are the most common methods for dealing with imbalanced
data, but yet there are widespread misconceptions related to these methods.
The most basic misconception concerns the notion that sampling methods are
equivalent to certain other methods for dealing with class imbalance. In partic-
ular, Breiman et al. [51] establishes the connection between the distribution of
training-set examples, the costs of mistakes on each class, and the placement of
the decision threshold. Thus, for example, one can make false negatives twice
Search WWH ::




Custom Search