Databases Reference
In-Depth Information
Between the passes of PCY, the hash table is summarized as a bitmap, with
one bit for each bucket. The bit is 1 if the bucket is frequent and 0 if not. Thus
integers of 32 bits are replaced by single bits, and the bitmap shown in the
second pass in Fig. 6.5 takes up only 1/32 of the space that would otherwise be
available to store counts. However, if most buckets are infrequent, we expect
that the number of pairs being counted on the second pass will be much smaller
than the total number of pairs of frequent items. Thus, PCY can handle some
data sets without thrashing during the second pass, while A-Priori would run
out of main memory and thrash.
There is another subtlety regarding the second pass of PCY that affects
the amount of space needed. While we were able to use the triangular-matrix
method on the second pass of A-Priori if we wished, because the frequent items
could be renumbered from 1 to some m, we cannot do so for PCY. The reason
is that the pairs of frequent items that PCY lets us avoid counting are placed
randomly within the triangular matrix; they are the pairs that happen to hash
to an infrequent bucket on the first pass. There is no known way of compacting
the matrix to avoid leaving space for the uncounted pairs.
Consequently, we are forced to use the triples method in PCY. That re-
striction may not matter if the fraction of pairs of frequent items that actually
appear in buckets were small; we would then want to use triples for A-Priori
anyway. However, if most pairs of frequent items appear together in at least
one bucket, then we are forced in PCY to use triples, while A-Priori can use a
triangular matrix. Thus, unless PCY lets us avoid counting at least 2/3 of the
frequent pairs, we cannot gain by using PCY instead of A-Priori.
While the discovery of frequent pairs by PCY differs significantly from A-
Priori, the later stages, where we find frequent triples and larger sets if desired,
are essentially the same as A-Priori. This statement holds as well for each of
the improvements to A-Priori that we cover in this section. As a result, we
shall cover only the construction of the frequent pairs from here on.
6.3.2
The Multistage Algorithm
The multistage Algorithm improves upon PCY by using several successive hash
tables to reduce further the number of candidate pairs. The tradeoff is that
Multistage takes more than two passes to find the frequent pairs. An outline of
the Multistage Algorithm is shown in Fig. 6.6.
The first pass of Multistage is the same as the first pass of PCY. After that
pass, the frequent buckets are identified and summarized by a bitmap, again
the same as in PCY. But the second pass of Multistage does not count the
candidate pairs. Rather, it uses the available main memory for another hash
table, using another hash function. Since the bitmap from the first hash table
takes up 1/32 of the available main memory, the second hash table has almost
as many buckets as the first.
On the second pass of Multistage, we again go through the file of baskets.
There is no need to count the items again, since we have those counts from
Search WWH ::




Custom Search