Database Reference
In-Depth Information
Figure 6.6 The Multistage Algorithm uses additional hash tables to reduce the number of candidate pairs
The first pass of Multistage is the same as the first pass of PCY. After that pass, the fre-
quent buckets are identified and summarized by a bitmap, again the same as in PCY. But
the second pass of Multistage does not count the candidate pairs. Rather, it uses the avail-
able main memory for another hash table, using another hash function. Since the bitmap
from the first hash table takes up 1/32 of the available main memory, the second hash table
has almost as many buckets as the first.
On the second pass of Multistage, we again go through the file of baskets. There is no
need to count the items again, since we have those counts from the first pass. However, we
must retain the information about which items are frequent, since we need it on both the
second and third passes. During the second pass, we hash certain pairs of items to buckets
of the second hash table. A pair is hashed only if it meets the two criteria for being counted
in the second pass of PCY; that is, we hash { i, j } if and only if i and j are both frequent, and
the pair hashed to a frequent bucket on the first pass. As a result, the sum of the counts in
the second hash table should be significantly less than the sum for the first pass. The result
is that, even though the second hash table has only 31/32 of the number of buckets that the
first table has, we expect there to be many fewer frequent buckets in the second hash table
than in the first.
After the second pass, the second hash table is also summarized as a bitmap, and that bit-
map is stored in main memory. The two bitmaps together take up slightly less than 1/16th
of the available main memory, so there is still plenty of space to count the candidate pairs
on the third pass. A pair { i, j } is in C 2 if and only if:
(1) i and j are both frequent items.
(2) { i, j } hashed to a frequent bucket in the first hash table.
(3) { i, j } hashed to a frequent bucket in the second hash table.
The third condition is the distinction between Multistage and PCY.
A Subtle Error in Multistage
Search WWH ::




Custom Search