Information Technology Reference
In-Depth Information
algorithm's responsibility to utilize this information appropriately; if the algo-
rithm cannot do this, then there is an algorithm-level issue. Fortunately, over
the past decade, most classification algorithms have increased in sophistication
so that they can handle evaluation criteria beyond accuracy, such as class-based
misclassification costs and even costs that vary per example.
The problem definition issue also extends to unsupervised learning problems.
Association rule mining systems do not have very good ways to evaluate the value
of an association rule. But unlike the case of classification, as no single quantita-
tive measure of quality is generated, this issue is probably better understood and
acknowledged. Association rules are usually tagged with support and confidence
values, but many rules with either high support or confidence values — or even
both — will be uninteresting and potentially of little value. The lift of an associ-
ation rule is a somewhat more useful measurement, but still does not consider
the context in which the association will be used (lift measures how much more
likely the antecedent and consequent of the rule are to occur together than if they
were statistically independent). But as with classification tasks, imbalanced data
causes further problems for the metrics most commonly used for association rule
mining. As mentioned earlier, association rules that involve rare items are not
likely to be generated, even if the rare items, when they do occur, often occur
together (e.g., cooking pan and spatula in supermarket sales). This is a problem
because such associations between rare items are more likely to be profitable
because higher profit margins are generally associated with rare items.
2.3.2 Data-Level Issues
The most fundamental data-level issue is the lack of training data that often
accompanies imbalanced data, which was previously referred to as an issue
of absolute rarity . Absolute rarity does not only occur when there is imbal-
anced data, but is very often present when there are extreme degrees of imbal-
ance — such as a class ratio of one to one million. In these cases, the number
of examples associated with the rare class, or rare case, is small in an absolute
sense. There is no predetermined threshold for determining absolute rarity and
any such threshold would have to be domain specific and would be determined
based on factors such as the dimensionality of the instance space, the distribution
of the feature values within this instance space, and, for classification tasks, the
complexity of the concept to be learned.
Figure 2.5 visually demonstrates the problems that can result from an “abso-
lute” lack of data. The figure shows a simple concept, identified by the solid rect-
angle; examples within this rectangle belong to the positive class and examples
outside this rectangle belong to the negative class. The decision boundary induced
by the classifier from the labeled training data is indicated by the dashed rectangle.
Figures 2.5a and 2.5b show the same concept but with Figure 2.5b having approx-
imately half as many training examples as in Figure 2.5a. As one would expect,
we see that the induced classifier more closely approximates the true decision
boundary in Figure 2.5a because of the availability of additional training data.
Search WWH ::




Custom Search