Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

The computation is done by intersection of the TID sets of the frequent k -itemsets to

compute the TID sets of the corresponding

-itemsets. This process repeats, with

k incremented by 1 each time, until no frequent itemsets or candidate itemsets can be

found.

Besides taking advantage of the Apriori property in the generation of candidate

.

k C1

/

.

-itemset from frequent k -itemsets, another merit of this method is that there

is no need to scan the database to find the support of

k C1

/

-itemsets (for k 1).

This is because the TID set of each k -itemset carries the complete information required

for counting such support. However, the TID sets can be quite long, taking substantial

memory space as well as computation time for intersecting the long sets.

To further reduce the cost of registering long TID sets, as well as the subsequent

costs of intersections, we can use a technique called diffset , which keeps track of only

the differences of the TID sets of a

.

k C1

/

-itemset and a corresponding k -itemset. For

instance, in Example 6.6 we have fI1gDfT100, T400, T500, T700, T800, T900g and fI1,

I2gDfT100, T400, T800, T900g. The diffset between the two is diffset (fI1, I2g, fI1g) D

fT500, T700g. Thus, rather than recording the four TIDs that make up the intersection of

fI1gandfI2g, we can instead use diffset to record just two TIDs, indicating the difference

betweenfI1gandfI1, I2g. Experiments show that in certain situations, such as when the

data set contains many dense and long patterns, this technique can substantially reduce

the total cost of vertical format mining of frequent itemsets.

.

k C1

/

6.2.6 Mining Closed and Max Patterns

In Section 6.1.2 we saw how frequent itemset mining may generate a huge number of

frequent itemsets, especially when the min sup threshold is set low or when there exist

long patterns in the data set. Example 6.2 showed that closed frequent itemsets 9 can

substantially reduce the number of patterns generated in frequent itemset mining while

preserving the complete information regarding the set of frequent itemsets. That is, from

the set of closed frequent itemsets, we can easily derive the set of frequent itemsets and

their support. Thus, in practice, it is more desirable to mine the set of closed frequent

itemsets rather than the set of all frequent itemsets in most cases.

“How can we mine closed frequent itemsets?” A naïve approach would be to first mine

the complete set of frequent itemsets and then remove every frequent itemset that is a

proper subset of, and carries the same support as, an existing frequent itemset. However,

this is quite costly. As shown in Example 6.2, this method would have to first derive

2 100 1 frequent itemsets to obtain a length-100 frequent itemset, all before it could

begin to eliminate redundant itemsets. This is prohibitively expensive. In fact, there exist

only a very small number of closed frequent itemsets in Example 6.2's data set.

A recommended methodology is to search for closed frequent itemsets directly dur-

ing the mining process. This requires us to prune the search space as soon as we

9 Remember that X is a closed frequent itemset in a data set S if there exists no proper super-itemset Y

such that Y has the same support count as X in S , and X satisfies minimum support.

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home