Information Technology Reference
In-Depth Information
examples that belong to these spaces) into smaller and smaller pieces. This pro-
cess leads to data fragmentation [16], which is a significant problem when trying
to identify rare patterns in the data, because there is less data in each parti-
tion from which to identify the rare patterns. Repeated partitioning can lead to
the problem of absolute rarity within an individual partition, even if the origi-
nal dataset exhibits only the problem of relative rarity. Data-mining algorithms
that do not employ a divide-and-conquer approach therefore tend to be more
appropriate when mining rare classes/cases.
2.4 METHODS FOR ADDRESSING IMBALANCED DATA
This section describes the methods that address the issues with learning from
imbalanced data that were identified in the previous section. These methods
are organized based on whether they operate at the problem definition, data,
or algorithm level. As methods are introduced, the underlying issues that they
address are highlighted. While this section covers most of the major methods
that have been developed to handle imbalanced data, the list of methods is not
exhaustive.
2.4.1 Problem-Definition-Level Methods
There are a number of methods for dealing with imbalanced data that operate at
the problem definition level. Some of these methods are relatively straightforward,
in that they directly address foundational issues that operate at the same level.
But because of the inherent difficulty of learning from imbalanced data, some
methods have been introduced that simplify the problem in order to produce
more reasonable results. Finally, it is important to note that in many cases, there
simply is insufficient information to properly define the problem and in these
cases, the best option is to utilize a method that moderates the impact of this
lack of knowledge.
2.4.1.1 Use Appropriate Evaluation Metrics It is always preferable to use eval-
uation metrics that properly factor in how the mined knowledge will be used.
Such metrics are essential when learning from imbalanced data because they will
properly value the minority class. These metrics can be contrasted with accu-
racy, which places more weight on the common classes and assigns value to
each class proportional to its frequency in the training set. The proper solution
is to use meaningful and appropriate evaluation metrics and for imbalanced data,
this typically translates into providing accurate cost information to the learn-
ing algorithms (which should then utilize cost-sensitive learning to produce an
appropriate classifier).
Unfortunately, it is not always possible to acquire the base information nec-
essary to design good evaluation metrics that properly value the minority class.
The next best solution is to provide evaluation metrics that are robust, given
Search WWH ::




Custom Search