Database Reference
In-Depth Information
in either hash table could be frequent, and a random infrequent pair has at most probability
(1/5) 2 = 0 . 04 of being in a frequent bucket in both hash tables.
By the same reasoning, the upper bound on the infrequent pair being in a frequent bucket
in the one PCY hash table is at most 1/10. That is, we might expect to have to count 2.5
times as many infrequent pairs in PCY as in the version of Multihash suggested above. We
would therefore expect Multihash to have a smaller memory requirement for the second
pass than would PCY.
But these upper bounds do not tell the complete story. There may be many fewer frequent
buckets than the maximum for either algorithm, since the presence of some very frequent
pairs will skew the distribution of bucket counts. However, this analysis is suggestive of
the possibility that for some data and support thresholds, we can do better by running sev-
eral hash functions in main memory at once.
For the second pass of Multihash, each hash table is converted to a bitmap, as usual.
Note that the two bitmaps for the two hash functions in Fig. 6.7 occupy exactly as much
space as a single bitmap would for the second pass of the PCY Algorithm. The conditions
for a pair { i, j } to be in C 2 , and thus to require a count on the second pass, are the same
as for the third pass of Multistage: i and j must both be frequent, and the pair must have
hashed to a frequent bucket according to both hash tables.
Just as Multistage is not limited to two hash tables, we can divide the available main
memory into as many hash tables as we like on the first pass of Multihash. The risk is that
should we use too many hash tables, the average count for a bucket will exceed the support
threshold. At that point, there may be very few infrequent buckets in any of the hash tables.
Even though a pair must hash to a frequent bucket in every hash table to be counted, we
may find that the probability an infrequent pair will be a candidate rises, rather than falls,
if we add another hash table.
6.3.4
Exercises for Section 6.3
EXERCISE 6.3.1 Here is a collection of twelve baskets. Each contains three of the six items
1 through 6.
{1, 2, 3} {2, 3, 4} {3, 4, 5} {4, 5, 6}
{1, 3, 5} {2, 4, 6} {1, 3, 4} {2, 4, 5}
{3, 5, 6} {1, 2, 4} {2, 3, 5} {3, 4, 6}
Suppose the support threshold is 4. On the first pass of the PCY Algorithm we use a hash
table with 11 buckets, and the set { i, j } is hashed to bucket i × j mod 11.
Search WWH ::




Custom Search