Database Reference
In-Depth Information
The set C 2 of candidate pairs is the set of pairs both of whose items are in L 1 ; that is,
they are frequent items. Note that we do not construct C 2 explicitly.
Rather we use the definition of C 2 , and we test membership in C 2 by testing whether both
of its members are in L 1 . The second pass of the A-Priori Algorithm counts all the candid-
ate pairs and determines which appear at least s times. These pairs form L 2 , the frequent
pairs.
We can follow this pattern as far as we wish. The set C 3 of candidate triples is construc-
ted (implicitly) as the set of triples, any two of which is a pair in L 2 . Our assumption about
the sparsity of frequent itemsets, outlined in Section 6.2.4 implies that there will not be too
many frequent pairs, so they can be listed in a main-memory table. Likewise, there will not
be too many candidate triples, so these can all be counted by a generalization of the triples
method. That is, while triples are used to count pairs, we would use quadruples, consisting
of the three item codes and the associated count, when we want to count triples. Similarly,
we can count sets of size k using tuples with k + 1 components, the last of which is the
count, and the first k of which are the item codes, in sorted order.
To find L 3 we make a third pass through the basket file. For each basket, we need only
look at those items that are in L 1 . From these items, we can examine each pair and determ-
ine whether or not that pair is in L 2 . Any item of the basket that does not appear in at least
two frequent pairs, both of which consist of items in the basket, cannot be part of a frequent
triple that the basket contains. Thus, we have a fairly limited search for triples that are both
contained in the basket and are candidates in C 3 . Any such triples found have 1 added to
their count.
EXAMPLE 6.8 Suppose our basket consists of items 1 through 10. Of these, 1 through 5
have been found to be frequent items, and the following pairs have been found frequent:
{1 , 2}, {2 , 3}, {3 , 4}, and {4 , 5}. At first, we eliminate the nonfrequent items, leaving only
1 through 5. However, 1 and 5 appear in only one frequent pair in the itemset, and there-
fore cannot contribute to a frequent triple contained in the basket. Thus, we must consider
adding to the count of triples that are contained in {2 , 3 , 4}. There is only one such triple, of
course. However, we shall not find it in C 3 , because {2 , 4} evidently is not frequent.
The construction of the collections of larger frequent itemsets and candidates proceeds
in essentially the same manner, until at some pass we find no new frequent itemsets and
stop. That is:
(1) Define C k to be all those itemsets of size k , every k −1 of which is an itemset in L k −1 .
(2) Find L k by making a pass through the baskets and counting all and only the itemsets of
size k that are in C k . Those itemsets that have count at least s are in L k .
Search WWH ::




Custom Search