Databases Reference
In-Depth Information
1
2
1
2
1
2
Item
names
to
integers
Item
names
to
integers
Item
names
to
integers
Fre−
Fre−
Item
quent
quent
counts
items
items
n
n
n
Bitmap 1
Bitmap 1
Bitmap 2
Second
hash table
Hash table
Data strucrure
for counts
of pairs
for bucket
for bucket
counts
counts
Pass 1
Pass 2
Pass 3
Figure 6.6: The Multistage Algorithm uses additional hash tables to reduce the
number of candidate pairs
the first pass. However, we must retain the information about which items are
frequent, since we need it on both the second and third passes. During the
second pass, we hash certain pairs of items to buckets of the second hash table.
A pair is hashed only if it meets the two criteria for being counted in the second
pass of PCY; that is, we hash{i, j}if and only if i and j are both frequent,
and the pair hashed to a frequent bucket on the first pass. As a result, the sum
of the counts in the second hash table should be significantly less than the sum
for the first pass. The result is that, even though the second hash table has
only 31/32 of the number of buckets that the first table has, we expect there
to be many fewer frequent buckets in the second hash table than in the first.
After the second pass, the second hash table is also summarized as a bitmap,
and that bitmap is stored in main memory. The two bitmaps together take up
slightly less than 1/16th of the available main memory, so there is still plenty
of space to count the candidate pairs on the third pass. A pair{i, j}is in C 2 if
and only if:
1. i and j are both frequent items.
2.{i, j}hashed to a frequent bucket in the first hash table.
3.{i, j}hashed to a frequent bucket in the second hash table.
The third condition is the distinction between Multistage and PCY.
It might be obvious that it is possible to insert any number of passes between
the first and last in the multistage Algorithm. There is a limiting factor that
Search WWH ::




Custom Search