Information Technology Reference
In-Depth Information
similar) distribution to those originally in the dataset. Note that sampling meth-
ods need not create an exactly balanced distribution, merely a distribution that
the traditional classifiers are better able to handle.
Two of the first sampling methods developed were random under-sampling
and random over-sampling. In the random under-sampling, the majority class
instances are discarded at random until a more balanced distribution is reached.
Consider, for example, a dataset consisting of 10 minority class instances and
100 majority class instances. In random under-sampling, one might attempt to
create a balanced class distribution by selecting 90 majority class instances at
random to be removed. The resulting dataset will then consist of 20 instances:
10 (randomly remaining) majority class instances and (the original) 10 minority
class instances.
Alternatively, in random over-sampling, minority class instances are copied
and repeated in the dataset until a more balanced distribution is reached. Thus, if
there are two minority class instances and 100 majority class instances, traditional
over-sampling would copy the two minority class instances 49 times each. The
resulting dataset would then consists of 200 instances: the 100 majority class
instances and 100 minority class instances (i.e., 50 copies each of the two minority
class instances).
While random under-sampling and random over-sampling create more bal-
anced distributions, they both suffer from serious drawbacks. For example, in
random under-sampling (potentially), vast quantities of data are discarded. In the
random under-sampling example mentioned above, for instance, roughly 82%
of the data (the 90 majority class instances) was discarded. This can be highly
problematic, as the loss of such data can make the decision boundary between
minority and majority instances harder to learn, resulting in a loss in classification
performance.
Alternatively, in random over-sampling, instances are repeated (sometimes
to very high degrees). Consider the random over-sampling example mentioned
above, where each instance had to be replicated 49 times in order to balance out
the class distribution. By copying instances in this way, one can cause drastic
overfitting to occur in the classifier, making the generalization performance of
the classifier exceptionally poor. The potential for overfitting is especially true as
the class imbalance ratio becomes worse, and each instance must be replicated
more and more often.
In order to overcome these limitations, more sophisticated sampling techniques
have been developed. We now describe some of these techniques.
3.2.1 Under-Sampling Techniques
The major drawback of random under-sampling is that potentially useful infor-
mation can be discarded when samples are chosen randomly. In order to combat
this, various techniques have been developed, which aim to retain all useful
information present in the majority class by removing redundant noisy , and/or
borderline instances from the dataset. Redundant instances are considered safe
to remove as they, by definition, do not add any information about the majority
Search WWH ::




Custom Search