Database Reference
In-Depth Information
larger, say 1000, then it must be that the great majority of the buckets are infrequent. The
greatest possible number of frequent buckets is 4 . 5×10 10 /1000, or 45 million out of the 250
million buckets.
Between the passes of PCY, the hash table is summarized as a bitmap , with one bit for
each bucket. The bit is 1 if the bucket is frequent and 0 if not. Thus integers of 32 bits are
replaced by single bits, and the bitmap shown in the second pass in Fig. 6.5 takes up only
1/32 of the space that would otherwise be available to store counts. However, if most buck-
ets are infrequent, we expect that the number of pairs being counted on the second pass will
be much smaller than the total number of pairs of frequent items. Thus, PCY can handle
some data sets without thrashing during the second pass, while A-Priori would run out of
main memory and thrash.
There is another subtlety regarding the second pass of PCY that affects the amount of
space needed. While we were able to use the triangular-matrix method on the second pass
of A-Priori if we wished, because the frequent items could be renumbered from 1 to some
m , we cannot do so for PCY. The reason is that the pairs of frequent items that PCY lets
us avoid counting are placed randomly within the triangular matrix; they are the pairs that
happen to hash to an infrequent bucket on the first pass. There is no known way of com-
pacting the matrix to avoid leaving space for the uncounted pairs.
Consequently, we are forced to use the triples method in PCY. That restriction may not
matter if the fraction of pairs of frequent items that actually appear in buckets were small;
we would then want to use triples for A-Priori anyway. However, if most pairs of frequent
items appear together in at least one bucket, then we are forced in PCY to use triples, while
A-Priori can use a triangular matrix. Thus, unless PCY lets us avoid counting at least 2/3
of the pairs of frequent items, we cannot gain by using PCY instead of A-Priori.
While the discovery of frequent pairs by PCY differs significantly from A-Priori, the
later stages, where we find frequent triples and larger sets if desired, are essentially the
same as A-Priori. This statement holds as well for each of the improvements to A-Priori
that we cover in this section. As a result, we shall cover only the construction of the fre-
quent pairs from here on.
6.3.2
The Multistage Algorithm
The Multistage Algorithm improves upon PCY by using several successive hash tables to
reduce further the number of candidate pairs. The tradeoff is that Multistage takes more
than two passes to find the frequent pairs. An outline of the Multistage Algorithm is shown
in Fig. 6.6 .
Search WWH ::




Custom Search