Information Technology Reference
In-Depth Information
associated items in the distribution [14]. Specifically, the user can specify a dif-
ferent minsup value for each item. The minimum support for an association rule
is then the lowest minsup value among the items in the rule. Association rule
mining systems are tractable mainly because of the downward closure property
of support: if a set of items satisfies minsup , then so do all of its subsets. While
this downward closure property does not hold with multiple minimum levels of
support, the standard Apriori algorithm for association rule mining can be modi-
fied to satisfy the sorted closure property for multiple minimum levels of support
[14]. The use of multiple minimum levels of support then becomes tractable.
Empirical results indicate that the new algorithm is able to find meaningful asso-
ciations involving rare items without producing a huge number of meaningless
rules involving common items.
2.5 MAPPING FOUNDATIONAL ISSUES TO SOLUTIONS
This section summarizes the foundational problems with imbalanced data
described in Section 2.3 and how they can be addressed by the various methods
described in Section 2.4. This section is organized using the three basic
categories identified earlier in this chapter: problem definition level, data level,
and algorithm level.
The problem-definition-level issues arise because researchers and practitioners
often do not have all of the necessary information about a problem to solve it
optimally. Most frequently this involves not possessing the necessary metrics
to accurately assess the utility of the mined knowledge. The solution to this
problem is simple, although often not achievable: obtain the requisite knowledge
and from this generate the metrics necessary to properly evaluate the mined
knowledge. Because this is not often possible, one must take the next best course
of action — use the best available metric or one that is at least “robust” such that it
will lead to good, albeit suboptimal solutions, given incomplete knowledge and
hence inexact assumptions. In dealing with imbalanced data, this often means
using ROC analysis when the necessary evaluation information is missing. One
alternate solution that was briefly discussed involves redefining the problem to
a simpler problem for which more exact evaluation information is available.
Fortunately the state of the art in data- mining technology has advanced to the
point where in most cases if we do have the precise evaluation information,
we can utilize it; in the past, data-mining algorithms were often not sufficiently
sophisticated to incorporate such knowledge.
Data-level issues also arise when learning from imbalanced data. These issues
mainly relate to absolute rarity. Absolute rarity occurs when one or more classes
do not have sufficient numbers of examples to adequately learn the decision
boundaries associated with that class. Absolute rarity has a much bigger impact
on the rare classes than on common classes. Absolute rarity also applies to rare
cases, which may occur for either rare classes or common classes, but are dis-
proportionately associated with rare classes. The ideal and most straightforward
Search WWH ::




Custom Search