Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

different values on some subtly different data sets. Let's examine data sets D 5 and D 6 ,

shown earlier in Table 6.9, where the two events m and c have unbalanced conditional

probabilities. That is, the ratio of mc to c is greater than 0.9. This means that knowing

that c occurs should strongly suggest that m occurs also. The ratio of mc to m is less than

0.1, indicating that m implies that c is quite unlikely to occur. The all confidence and

cosine measures view both cases as negatively associated and the Kulc measure views

both as neutral. The max confidence measure claims strong positive associations for

these cases. The measures give very diverse results!

“ Which measure intuitively reflects the true relationship between the purchase of milk

and coffee? ” Due to the “ balanced ” skewness of the data, it is difficult to argue whether

the two dat a s ets have positive or negative association. From one point of view, only

/D 9.09% of milk-related transactions contain

coffee in D 5 and this percentage is 1000

mc C mc

/D 1000

1000C10, 000

1000C100, 000

/D 0.99% in D 6 , both indi-

cating a n eg ative association.

On the other hand, 90.9% of transactions in D 5 (i.e.,

) contain-

ing coffee contain milk as well, which indicates a positive association between milk and

coffee. These draw very different conclusions.

For such “balanced” skewness, it could be fair to treat it as neutral, as Kulc does,

and in the meantime indicate its skewness using the imbalance ratio (IR) . According to

Eq. (6.13), for D 4 we have IR

mc C mc

/D 1000

1000C100

) and 9% in D 6 (i.e., 1000

1000C10

m , c

/D 0, a perfectly balanced case; for D 5 , IR

m , c

0.89, a rather imbalanced case; whereas for D 6 , IR

/D 0.99, a very skewed case.

Therefore, the two measures, Kulc and IR , work together, presenting a clear picture for

all three data sets, D 4 through D 6 .

m , c

In summary, the use of only support and confidence measures to mine associa-

tions may generate a large number of rules, many of which can be uninteresting to

users. Instead, we can augment the support-confidence framework with a pattern inter-

estingness measure, which helps focus the mining toward rules with strong pattern

relationships. The added measure substantially reduces the number of rules gener-

ated and leads to the discovery of more meaningful rules. Besides those introduced in

this section, many other interestingness measures have been studied in the literature.

Unfortunately, most of them do not have the null-invariance property. Because large

data sets typically have many null-transactions, it is important to consider the null-

invariance property when selecting appropriate interestingness measures for pattern

evaluation. Among the four null-invariant measures studied here, namely all confidence ,

max confidence , Kulc , and cosine , we recommend using Kulc in conjunction with the

imbalance ratio.

6.4 Summary

The discovery of frequent patterns, associations, and correlation relationships among

huge amounts of data is useful in selective marketing, decision analysis, and business

management. A popular area of application is market basket analysis , which studies

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home