Frequent Itemsets - Mining of Massive Datasets

Databases Reference

In-Depth Information

Between the passes of PCY, the hash table is summarized as a bitmap, with

one bit for each bucket. The bit is 1 if the bucket is frequent and 0 if not. Thus

integers of 32 bits are replaced by single bits, and the bitmap shown in the

second pass in Fig. 6.5 takes up only 1/32 of the space that would otherwise be

available to store counts. However, if most buckets are infrequent, we expect

that the number of pairs being counted on the second pass will be much smaller

than the total number of pairs of frequent items. Thus, PCY can handle some

data sets without thrashing during the second pass, while A-Priori would run

out of main memory and thrash.

There is another subtlety regarding the second pass of PCY that affects

the amount of space needed. While we were able to use the triangular-matrix

method on the second pass of A-Priori if we wished, because the frequent items

could be renumbered from 1 to some m, we cannot do so for PCY. The reason

is that the pairs of frequent items that PCY lets us avoid counting are placed

randomly within the triangular matrix; they are the pairs that happen to hash

to an infrequent bucket on the first pass. There is no known way of compacting

the matrix to avoid leaving space for the uncounted pairs.

Consequently, we are forced to use the triples method in PCY. That re-

striction may not matter if the fraction of pairs of frequent items that actually

appear in buckets were small; we would then want to use triples for A-Priori

anyway. However, if most pairs of frequent items appear together in at least

one bucket, then we are forced in PCY to use triples, while A-Priori can use a

triangular matrix. Thus, unless PCY lets us avoid counting at least 2/3 of the

frequent pairs, we cannot gain by using PCY instead of A-Priori.

While the discovery of frequent pairs by PCY differs significantly from A-

Priori, the later stages, where we find frequent triples and larger sets if desired,

are essentially the same as A-Priori. This statement holds as well for each of

the improvements to A-Priori that we cover in this section. As a result, we

shall cover only the construction of the frequent pairs from here on.

6.3.2

The Multistage Algorithm

The multistage Algorithm improves upon PCY by using several successive hash

tables to reduce further the number of candidate pairs. The tradeoff is that

Multistage takes more than two passes to find the frequent pairs. An outline of

the Multistage Algorithm is shown in Fig. 6.6.

The first pass of Multistage is the same as the first pass of PCY. After that

pass, the frequent buckets are identified and summarized by a bitmap, again

the same as in PCY. But the second pass of Multistage does not count the

candidate pairs. Rather, it uses the available main memory for another hash

table, using another hash function. Since the bitmap from the first hash table

takes up 1/32 of the available main memory, the second hash table has almost

as many buckets as the first.

On the second pass of Multistage, we again go through the file of baskets.

There is no need to count the items again, since we have those counts from

Mining of Massive Datasets

Search WWH ::

Custom Search

Home