Frequent Itemsets - Mining of Massive Datasets

Database Reference

In-Depth Information

The set C 2 of candidate pairs is the set of pairs both of whose items are in L 1 ; that is,

they are frequent items. Note that we do not construct C 2 explicitly.

Rather we use the definition of C 2 , and we test membership in C 2 by testing whether both

of its members are in L 1 . The second pass of the A-Priori Algorithm counts all the candid-

ate pairs and determines which appear at least s times. These pairs form L 2 , the frequent

pairs.

We can follow this pattern as far as we wish. The set C 3 of candidate triples is construc-

ted (implicitly) as the set of triples, any two of which is a pair in L 2 . Our assumption about

the sparsity of frequent itemsets, outlined in Section 6.2.4 implies that there will not be too

many frequent pairs, so they can be listed in a main-memory table. Likewise, there will not

be too many candidate triples, so these can all be counted by a generalization of the triples

method. That is, while triples are used to count pairs, we would use quadruples, consisting

of the three item codes and the associated count, when we want to count triples. Similarly,

we can count sets of size k using tuples with k + 1 components, the last of which is the

count, and the first k of which are the item codes, in sorted order.

To find L 3 we make a third pass through the basket file. For each basket, we need only

look at those items that are in L 1 . From these items, we can examine each pair and determ-

ine whether or not that pair is in L 2 . Any item of the basket that does not appear in at least

two frequent pairs, both of which consist of items in the basket, cannot be part of a frequent

triple that the basket contains. Thus, we have a fairly limited search for triples that are both

contained in the basket and are candidates in C 3 . Any such triples found have 1 added to

their count.

EXAMPLE 6.8 Suppose our basket consists of items 1 through 10. Of these, 1 through 5

have been found to be frequent items, and the following pairs have been found frequent:

{1 , 2}, {2 , 3}, {3 , 4}, and {4 , 5}. At first, we eliminate the nonfrequent items, leaving only

1 through 5. However, 1 and 5 appear in only one frequent pair in the itemset, and there-

fore cannot contribute to a frequent triple contained in the basket. Thus, we must consider

adding to the count of triples that are contained in {2 , 3 , 4}. There is only one such triple, of

course. However, we shall not find it in C 3 , because {2 , 4} evidently is not frequent.

□

The construction of the collections of larger frequent itemsets and candidates proceeds

in essentially the same manner, until at some pass we find no new frequent itemsets and

stop. That is:

(1) Define C k to be all those itemsets of size k , every k −1 of which is an itemset in L k −1 .

(2) Find L k by making a pass through the baskets and counting all and only the itemsets of

size k that are in C k . Those itemsets that have count at least s are in L k .

Search WWH ::

Custom Search

Home