Databases Reference
In-Depth Information
1
2
1
2
Item
names
to
integers
Item
names
to
integers
Fre−
Item
quent
counts
items
n
n
Bitmap 1
Bitmap 2
Hash Table 1
Data strucrure
for counts
of pairs
Hash Table 2
Pass 1
Pass 2
Figure 6.7: The Multihash Algorithm uses several hash tables in one pass
version of Multihash suggested above. We would therefore expect Multihash to
have a smaller memory requirement for the second pass than would PCY.
But these upper bounds do not tell the complete story. There may be
many fewer frequent buckets than the maximum for either algorithm, since the
presence of some very frequent pairs will skew the distribution of bucket counts.
However, this analysis is suggestive of the possibility that for some data and
support thresholds, we can do better by running several hash functions in main
memory at once.
2
For the second pass of Multihash, each hash table is converted to a bitmap,
as usual. Note that the two bitmaps for the two hash functions in Fig. 6.7
occupy exactly as much space as a single bitmap would for the second pass of
the PCY Algorithm. The conditions for a pair{i, j}to be in C 2 , and thus
to require a count on the second pass, are the same as for the third pass of
Multistage: i and j must both be frequent, and the pair must have hashed to
a frequent bucket according to both hash tables.
Just as Multistage is not limited to two hash tables, we can divide the
available main memory into as many hash tables as we like on the first pass of
Multihash. The risk is that should we use too many hash tables, the average
count for a bucket will exceed the support threshold. At that point, there may
be very few infrequent buckets in any of the hash tables. Even though a pair
must hash to a frequent bucket in every hash table to be counted, we may find
that the probability an infrequent pair will be a candidate rises, rather than
falls, if we add another hash table.
Search WWH ::




Custom Search