Databases Reference
In-Depth Information
in Section 6.2.4 implies that there will not be too many frequent pairs, so they
can be listed in a main-memory table. Likewise, there will not be too many
candidate triples, so these can all be counted by a generalization of the triples
method. That is, while triples are used to count pairs, we would use quadruples,
consisting of the three item codes and the associated count, when we want to
count triples. Similarly, we can count sets of size k using tuples with k + 1
components, the last of which is the count, and the first k of which are the item
codes, in sorted order.
To find L 3 we make a third pass through the basket file. For each basket,
we need only look at those items that are in L 1 . From these items, we can
examine each pair and determine whether or not that pair is in L 2 . Any item
of the basket that does not appear in at least two frequent pairs, both of which
consist of items in the basket, cannot be part of a frequent triple that the
basket contains. Thus, we have a fairly limited search for triples that are both
contained in the basket and are candidates in C 3 . Any such triples found have
1 added to their count.
Example 6.8 : Suppose our basket consists of items 1 through 10. Of these, 1
through 5 have been found to be frequent items, and the following pairs have
been found frequent:{1, 2},{2, 3},{3, 4}, and{4, 5}. At first, we eliminate the
nonfrequent items, leaving only 1 through 5. However, 1 and 5 appear in only
one frequent pair in the itemset, and therefore cannot contribute to a frequent
triple contained in the basket. Thus, we must consider adding to the count of
triples that are contained in{2, 3, 4}. There is only one such triple, of course.
However, we shall not find it in C 3 , because{2, 4}evidently is not frequent.
2
The construction of the collections of larger frequent itemsets and candidates
proceeds in essentially the same manner, until at some pass we find no new
frequent itemsets and stop. That is:
1. Define C k to be all those itemsets of size k, every k−1 of which is an
itemset in L k−1 .
2. Find L k by making a pass through the baskets and counting all and only
the itemsets of size k that are in C k . Those itemsets that have count at
least s are in L k .
6.2.7
Exercises for Section 6.2
Exercise 6.2.1 : If we use a triangular matrix to count pairs, and n, the num-
ber of items, is 20, what pair's count is in a[100]?
! Exercise 6.2.2 : In our description of the triangular-matrix method in Sec-
tion 6.2.2, the formula for k involves dividing an arbitrary integer i by 2. Yet
we need to have k always be an integer. Prove that k will, in fact, be an integer.
Search WWH ::




Custom Search