Information Technology Reference
In-Depth Information
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
3
6
9
12
15
18
21
24
27
30
33
Disjunct size
Figure 2.4
Impact of disjunct size on classifier performance (move dataset).
2.2.3
Imbalanced Data for Unsupervised Learning Tasks
Virtually all work that focuses explicitly on imbalanced data focuses on imbal-
anced data for classification. While classification is a key supervised learning
task, imbalanced data can affect unsupervised learning tasks as well, such as
clustering and association rule mining. There has been very little work on the
effect of imbalanced data with respect to clustering, largely because it is difficult
to quantify “imbalance” in such cases (in many ways, this parallels the issues
with identifying rare cases). But certainly if there are meaningful clusters con-
taining relatively few examples, existing clustering methods will have trouble
identifying them. There has been more work in the area of association rule min-
ing, especially with regard to market basket analysis, which looks at how the
items purchased by a customer are related. Some groupings of items, such as
peanut butter and jelly , occur frequently and can be considered common cases.
Other associations may be extremely rare, but represent highly profitable sales.
For example, cooking pan and spatula will be an extremely rare association in
a supermarket, not because the items are unlikely to be purchased together, but
because neither item is frequently purchased in a supermarket [14]. Association
rule mining algorithms should ideally be able to identify such associations.
2.3 FOUNDATIONAL ISSUES
Now that we have established the necessary background and terminology, and
demonstrated some of the problems associated with class imbalance, we are
ready to identify and discuss the specific issues and problems associated with
learning from imbalanced data. These issues can be divided into three major
Search WWH ::




Custom Search