Frequent Itemsets - Mining of Massive Datasets

Databases Reference

In-Depth Information

1

2

1

2

Item

names

to

integers

Item

names

to

integers

Fre−

Item

quent

counts

items

n

Bitmap 1

Bitmap 2

Hash Table 1

Data strucrure

for counts

of pairs

Hash Table 2

Pass 1

Pass 2

Figure 6.7: The Multihash Algorithm uses several hash tables in one pass

version of Multihash suggested above. We would therefore expect Multihash to

have a smaller memory requirement for the second pass than would PCY.

But these upper bounds do not tell the complete story. There may be

many fewer frequent buckets than the maximum for either algorithm, since the

presence of some very frequent pairs will skew the distribution of bucket counts.

However, this analysis is suggestive of the possibility that for some data and

support thresholds, we can do better by running several hash functions in main

memory at once.

2

For the second pass of Multihash, each hash table is converted to a bitmap,

as usual. Note that the two bitmaps for the two hash functions in Fig. 6.7

occupy exactly as much space as a single bitmap would for the second pass of

the PCY Algorithm. The conditions for a pair{i, j}to be in C 2 , and thus

to require a count on the second pass, are the same as for the third pass of

Multistage: i and j must both be frequent, and the pair must have hashed to

a frequent bucket according to both hash tables.

Just as Multistage is not limited to two hash tables, we can divide the

available main memory into as many hash tables as we like on the first pass of

Multihash. The risk is that should we use too many hash tables, the average

count for a bucket will exceed the support threshold. At that point, there may

be very few infrequent buckets in any of the hash tables. Even though a pair

must hash to a frequent bucket in every hash table to be counted, we may find

that the probability an infrequent pair will be a candidate rises, rather than

falls, if we add another hash table.

Mining of Massive Datasets

Search WWH ::

Custom Search

Home