FOUNDATIONS OF IMBALANCED LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

examples that belong to these spaces) into smaller and smaller pieces. This pro-

cess leads to data fragmentation [16], which is a significant problem when trying

to identify rare patterns in the data, because there is less data in each parti-

tion from which to identify the rare patterns. Repeated partitioning can lead to

the problem of absolute rarity within an individual partition, even if the origi-

nal dataset exhibits only the problem of relative rarity. Data-mining algorithms

that do not employ a divide-and-conquer approach therefore tend to be more

appropriate when mining rare classes/cases.

2.4 METHODS FOR ADDRESSING IMBALANCED DATA

This section describes the methods that address the issues with learning from

imbalanced data that were identified in the previous section. These methods

are organized based on whether they operate at the problem definition, data,

or algorithm level. As methods are introduced, the underlying issues that they

address are highlighted. While this section covers most of the major methods

that have been developed to handle imbalanced data, the list of methods is not

exhaustive.

2.4.1 Problem-Definition-Level Methods

There are a number of methods for dealing with imbalanced data that operate at

the problem definition level. Some of these methods are relatively straightforward,

in that they directly address foundational issues that operate at the same level.

But because of the inherent difficulty of learning from imbalanced data, some

methods have been introduced that simplify the problem in order to produce

more reasonable results. Finally, it is important to note that in many cases, there

simply is insufficient information to properly define the problem and in these

cases, the best option is to utilize a method that moderates the impact of this

lack of knowledge.

2.4.1.1 Use Appropriate Evaluation Metrics It is always preferable to use eval-

uation metrics that properly factor in how the mined knowledge will be used.

Such metrics are essential when learning from imbalanced data because they will

properly value the minority class. These metrics can be contrasted with accu-

racy, which places more weight on the common classes and assigns value to

each class proportional to its frequency in the training set. The proper solution

is to use meaningful and appropriate evaluation metrics and for imbalanced data,

this typically translates into providing accurate cost information to the learn-

ing algorithms (which should then utilize cost-sensitive learning to produce an

appropriate classifier).

Unfortunately, it is not always possible to acquire the base information nec-

essary to design good evaluation metrics that properly value the minority class.

The next best solution is to provide evaluation metrics that are robust, given

Search WWH ::

Custom Search

Home