Frequent Itemsets - Mining of Massive Datasets

Database Reference

In-Depth Information

larger, say 1000, then it must be that the great majority of the buckets are infrequent. The

greatest possible number of frequent buckets is 4 . 5×10 10 /1000, or 45 million out of the 250

million buckets.

□

Between the passes of PCY, the hash table is summarized as a bitmap , with one bit for

each bucket. The bit is 1 if the bucket is frequent and 0 if not. Thus integers of 32 bits are

replaced by single bits, and the bitmap shown in the second pass in Fig. 6.5 takes up only

1/32 of the space that would otherwise be available to store counts. However, if most buck-

ets are infrequent, we expect that the number of pairs being counted on the second pass will

be much smaller than the total number of pairs of frequent items. Thus, PCY can handle

some data sets without thrashing during the second pass, while A-Priori would run out of

main memory and thrash.

There is another subtlety regarding the second pass of PCY that affects the amount of

space needed. While we were able to use the triangular-matrix method on the second pass

of A-Priori if we wished, because the frequent items could be renumbered from 1 to some

m , we cannot do so for PCY. The reason is that the pairs of frequent items that PCY lets

us avoid counting are placed randomly within the triangular matrix; they are the pairs that

happen to hash to an infrequent bucket on the first pass. There is no known way of com-

pacting the matrix to avoid leaving space for the uncounted pairs.

Consequently, we are forced to use the triples method in PCY. That restriction may not

matter if the fraction of pairs of frequent items that actually appear in buckets were small;

we would then want to use triples for A-Priori anyway. However, if most pairs of frequent

items appear together in at least one bucket, then we are forced in PCY to use triples, while

A-Priori can use a triangular matrix. Thus, unless PCY lets us avoid counting at least 2/3

of the pairs of frequent items, we cannot gain by using PCY instead of A-Priori.

While the discovery of frequent pairs by PCY differs significantly from A-Priori, the

later stages, where we find frequent triples and larger sets if desired, are essentially the

same as A-Priori. This statement holds as well for each of the improvements to A-Priori

that we cover in this section. As a result, we shall cover only the construction of the fre-

quent pairs from here on.

6.3.2

The Multistage Algorithm

The Multistage Algorithm improves upon PCY by using several successive hash tables to

reduce further the number of candidate pairs. The tradeoff is that Multistage takes more

than two passes to find the frequent pairs. An outline of the Multistage Algorithm is shown

in Fig. 6.6 .

Search WWH ::

Custom Search

Home