FOUNDATIONS OF IMBALANCED LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

algorithm's responsibility to utilize this information appropriately; if the algo-

rithm cannot do this, then there is an algorithm-level issue. Fortunately, over

the past decade, most classification algorithms have increased in sophistication

so that they can handle evaluation criteria beyond accuracy, such as class-based

misclassification costs and even costs that vary per example.

The problem definition issue also extends to unsupervised learning problems.

Association rule mining systems do not have very good ways to evaluate the value

of an association rule. But unlike the case of classification, as no single quantita-

tive measure of quality is generated, this issue is probably better understood and

acknowledged. Association rules are usually tagged with support and confidence

values, but many rules with either high support or confidence values — or even

both — will be uninteresting and potentially of little value. The lift of an associ-

ation rule is a somewhat more useful measurement, but still does not consider

the context in which the association will be used (lift measures how much more

likely the antecedent and consequent of the rule are to occur together than if they

were statistically independent). But as with classification tasks, imbalanced data

causes further problems for the metrics most commonly used for association rule

mining. As mentioned earlier, association rules that involve rare items are not

likely to be generated, even if the rare items, when they do occur, often occur

together (e.g., cooking pan and spatula in supermarket sales). This is a problem

because such associations between rare items are more likely to be profitable

because higher profit margins are generally associated with rare items.

2.3.2 Data-Level Issues

The most fundamental data-level issue is the lack of training data that often

accompanies imbalanced data, which was previously referred to as an issue

of absolute rarity . Absolute rarity does not only occur when there is imbal-

anced data, but is very often present when there are extreme degrees of imbal-

ance — such as a class ratio of one to one million. In these cases, the number

of examples associated with the rare class, or rare case, is small in an absolute

sense. There is no predetermined threshold for determining absolute rarity and

any such threshold would have to be domain specific and would be determined

based on factors such as the dimensionality of the instance space, the distribution

of the feature values within this instance space, and, for classification tasks, the

complexity of the concept to be learned.

Figure 2.5 visually demonstrates the problems that can result from an “abso-

lute” lack of data. The figure shows a simple concept, identified by the solid rect-

angle; examples within this rectangle belong to the positive class and examples

outside this rectangle belong to the negative class. The decision boundary induced

by the classifier from the labeled training data is indicated by the dashed rectangle.

Figures 2.5a and 2.5b show the same concept but with Figure 2.5b having approx-

imately half as many training examples as in Figure 2.5a. As one would expect,

we see that the induced classifier more closely approximates the true decision

boundary in Figure 2.5a because of the availability of additional training data.

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home