Frequent Itemsets - Mining of Massive Datasets

Databases Reference

In-Depth Information

3. For each frequent pair, add one to its count in the data structure used to

store counts.

Finally, at the end of the second pass, examine the structure of counts to

determine which pairs are frequent.

6.2.6 A-Priori for All Frequent Itemsets

The same technique used for finding frequent pairs without counting all pairs

lets us find larger frequent itemsets without an exhaustive count of all sets. In

the A-Priori Algorithm, one pass is taken for each set-size k. If no frequent

itemsets of a certain size are found, then monotonicity tells us there can be no

larger frequent itemsets, so we can stop.

The pattern of moving from one size k to the next size k + 1 can be sum-

marized as follows. For each size k, there are two sets of itemsets:

1. C k is the set of candidate itemsets of size k - the itemsets that we must

count in order to determine whether they are in fact frequent.

2. L k is the set of truly frequent itemsets of size k.

The pattern of moving from one set to the next and one size to the next is

suggested by Fig. 6.4.

C

L

C

L

C

L

1

2

3

. . .

Filter

Construct

Filter

Construct

Filter

Construct

Pairs of

frequent

items

All

items

Frequent

items

pairs

Figure 6.4: The A-Priori Algorithm alternates between constructing candidate

sets and filtering to find those that are truly frequent

We start with C 1 , which is all singleton itemsets, i.e., the items themselves.

That is, before we examine the data, any item could be frequent as far as we

know. The first filter step is to count all items, and those whose counts are at

least the support threshold s form the set L 1 of frequent items.

The set C 2 of candidate pairs is the set of pairs both of whose items are

in L 1 ; that is, they are frequent items. Note that we do not construct C 2

explicitly. Rather we use the definition of C 2 , and we test membership in C 2 by

testing whether both of its members are in L 1 . The second pass of the A-Priori

Algorithm counts all the candidate pairs and determines which appear at least

s times. These pairs form L 2 , the frequent pairs.

We can follow this pattern as far as we wish. The set C 3 of candidate

triples is constructed (implicitly) as the set of triples, any two of which is a

pair in L 2 . Our assumption about the sparsity of frequent itemsets, outlined

Mining of Massive Datasets

Search WWH ::

Custom Search

Home