Information Technology Reference
In-Depth Information
as costly as false positives by assigning appropriate costs or by increasing the
ratio of positive to negative examples in the training set by a factor of 2, or
by setting the probability threshold for determining the class label to two-thirds
rather than to one-half. Unfortunately, as implemented in real-world situations,
these equivalences do not hold.
As a concrete example, suppose that a training set has 10,000 examples and
a class distribution of 100 : 1, so that there are only 100 positive examples. One
way to improve the identification of the rare class is to impose a greater cost
for false negatives than for false positives. A cost ratio of 100 : 1 is theoretically
equivalent to modifying the training distribution, so that it is balanced, with a
1 : 1 class ratio. To generate such a balanced distribution in practice, one would
typically oversample the minority class or undersample the majority class, or do
both. But if one undersamples the majority class, then potentially valuable data
is thrown away, and if one oversamples the minority class, then one is making
exact copies of examples, which can lead to overfitting. For the equivalence to
hold, one should randomly select new minority class examples from the orig-
inal distribution, which would include examples that are not already available
for training. But this is almost never feasible. Even generating new, synthetic,
minority class examples violates the equivalence, as these examples will, at best,
only be a better approximation of the true distribution. Thus, sampling methods
are not equivalent in practice to other methods for dealing with imbalanced data
and they have drawbacks that other methods, such as cost-sensitive learning, do
not have, if implemented properly.
Another significant concern with sampling is that its impact is often not fully
understood — or even considered. Increasing the proportion of examples belong-
ing to the rare class has two distinct effects. First, it will help address the problems
with relative rarity, and, if the examples are new examples, will also address the
problem with absolute rarity by injecting new knowledge. However, if no correc-
tive action is taken, it will also have a second effect — it will impose nonuniform
error costs, causing the learner to be biased in favor of predicting the rare class.
In many situations, this second effect is desired and is the actually the main
reason for altering the class distribution of the training data. But in other cases,
namely when new examples are added (e.g., via active learning), this effect is
not desirable. That is, in these other cases, the intent is to improve performance
with respect to the rare class by having more data available for that class, not
by biasing the data-mining algorithm toward that class. In these cases, this bias
should be removed.
The bias introduced toward predicting the oversampled class can be removed
using the equivalences noted earlier to account for the differences between the
training distribution and the underlying distribution [4, 43]. For example, the
bias can be removed by adjusting the decision thresholds, as was done in one
study that demonstrated the positive impact of removing this unintended bias [4].
That study showed that adding new examples to alter the class distribution of
the training data, so that it deviates from the natural, underlying, distribution,
improved classifier performance. However, classifier performance was improved
Search WWH ::




Custom Search