Databases Reference
In-Depth Information
different values on some subtly different data sets. Let's examine data sets D 5 and D 6 ,
shown earlier in Table 6.9, where the two events m and c have unbalanced conditional
probabilities. That is, the ratio of mc to c is greater than 0.9. This means that knowing
that c occurs should strongly suggest that m occurs also. The ratio of mc to m is less than
0.1, indicating that m implies that c is quite unlikely to occur. The all confidence and
cosine measures view both cases as negatively associated and the Kulc measure views
both as neutral. The max confidence measure claims strong positive associations for
these cases. The measures give very diverse results!
Which measure intuitively reflects the true relationship between the purchase of milk
and coffee? ” Due to the “ balanced ” skewness of the data, it is difficult to argue whether
the two dat a s ets have positive or negative association. From one point of view, only
mc
/D 9.09% of milk-related transactions contain
coffee in D 5 and this percentage is 1000
=.
mc C mc
/D 1000
=.
1000C10, 000
=.
1000C100, 000
/D 0.99% in D 6 , both indi-
cating a n eg ative association.
On the other hand, 90.9% of transactions in D 5 (i.e.,
mc
) contain-
ing coffee contain milk as well, which indicates a positive association between milk and
coffee. These draw very different conclusions.
For such “balanced” skewness, it could be fair to treat it as neutral, as Kulc does,
and in the meantime indicate its skewness using the imbalance ratio (IR) . According to
Eq. (6.13), for D 4 we have IR
=.
mc C mc
/D 1000
=.
1000C100
/
) and 9% in D 6 (i.e., 1000
=.
1000C10
/
.
m , c
/D 0, a perfectly balanced case; for D 5 , IR
.
m , c
/D
0.89, a rather imbalanced case; whereas for D 6 , IR
/D 0.99, a very skewed case.
Therefore, the two measures, Kulc and IR , work together, presenting a clear picture for
all three data sets, D 4 through D 6 .
.
m , c
In summary, the use of only support and confidence measures to mine associa-
tions may generate a large number of rules, many of which can be uninteresting to
users. Instead, we can augment the support-confidence framework with a pattern inter-
estingness measure, which helps focus the mining toward rules with strong pattern
relationships. The added measure substantially reduces the number of rules gener-
ated and leads to the discovery of more meaningful rules. Besides those introduced in
this section, many other interestingness measures have been studied in the literature.
Unfortunately, most of them do not have the null-invariance property. Because large
data sets typically have many null-transactions, it is important to consider the null-
invariance property when selecting appropriate interestingness measures for pattern
evaluation. Among the four null-invariant measures studied here, namely all confidence ,
max confidence , Kulc , and cosine , we recommend using Kulc in conjunction with the
imbalance ratio.
6.4 Summary
The discovery of frequent patterns, associations, and correlation relationships among
huge amounts of data is useful in selective marketing, decision analysis, and business
management. A popular area of application is market basket analysis , which studies
 
Search WWH ::




Custom Search