Database Reference
In-Depth Information
Occasionally, an implementation tries to eliminate the second requirement for { i, j } to be a candidate - that it hashes
to a frequent bucket on the first pass. The (false) reasoning is that if it didn't hash to a frequent bucket on the first
pass, it wouldn't have been hashed at all on the second pass, and thus would not contribute to the count of its bucket
on the second pass. While it is true that the pair is not counted on the second pass, that doesn't mean it wouldn't have
hashed to a frequent bucket had it been hashed. Thus, it is entirely possible that { i, j } consists of two frequent items
and hashes to a frequent bucket on the second pass, yet it did not hash to a frequent bucket on the first pass. Therefore,
all three conditions must be checked on the counting pass of Multistage.
It might be obvious that it is possible to insert any number of passes between the first
and last in the multistage Algorithm. There is a limiting factor that each pass must store
the bitmaps from each of the previous passes. Eventually, there is not enough space left in
main memory to do the counts. No matter how many passes we use, the truly frequent pairs
will always hash to a frequent bucket, so there is no way to avoid counting them.
6.3.3
The Multihash Algorithm
Sometimes, we can get most of the benefit of the extra passes of the Multistage Algorithm
in a single pass. This variation of PCY is called the Multihash Algorithm . Instead of using
two different hash tables on two successive passes, use two hash functions and two separ-
ate hash tables that share main memory on the first pass, as suggested by Fig. 6.7 .
Figure 6.7 The Multihash Algorithm uses several hash tables in one pass
The danger of using two hash tables on one pass is that each hash table has half as many
buckets as the one large hash table of PCY. As long as the average count of a bucket for
PCY is much lower than the support threshold, we can operate two half-sized hash tables
and still expect most of the buckets of both hash tables to be infrequent. Thus, in this situ-
ation we might well choose the multihash approach.
EXAMPLE 6.10 Suppose that if we run PCY, the average bucket will have a count s /10,
where s is the support threshold. Then if we used the Multihash approach with two half-
sized hash tables, the average count would be s /5. As a result, at most 1/5th of the buckets
Search WWH ::




Custom Search