Database Reference
In-Depth Information
Figure 6.5 Organization of main memory for the first two passes of the PCY Algorithm
Think of this array as a hash table, whose buckets hold integers rather than sets of keys
(as in an ordinary hash table) or bits (as in a Bloom filter). Pairs of items are hashed to
buckets of this hash table. As we examine a basket during the first pass, we not only add 1
to the count for each item in the basket, but we generate all the pairs, using a double loop.
We hash each pair, and we add 1 to the bucket into which that pair hashes. Note that the
pair itself doesn't go into the bucket; the pair only affects the single integer in the bucket.
At the end of the first pass, each bucket has a count, which is the sum of the counts of all
the pairs that hash to that bucket. If the count of a bucket is at least as great as the support
threshold s , it is called a frequent bucket . We can say nothing about the pairs that hash to a
frequent bucket; they could all be frequent pairs from the information available to us. But if
the count of the bucket is less than s (an infrequent bucket ), we know no pair that hashes to
this bucket can be frequent, even if the pair consists of two frequent items. That fact gives
us an advantage on the second pass. We can define the set of candidate pairs C 2 to be those
pairs { i, j } such that:
(1) i and j are frequent items.
(2) { i, j } hashes to a frequent bucket.
It is the second condition that distinguishes PCY from A-Priori.
EXAMPLE 6.9 Depending on the data and the amount of available main memory, there may
or may not be a benefit to using the hash table on pass 1. In the worst case, all buckets
are frequent, and the PCY Algorithm counts exactly the same pairs as A-Priori does on the
second pass. However, sometimes, we can expect most of the buckets to be infrequent. In
that case, PCY reduces the memory requirements of the second pass.
Suppose we have a gigabyte of main memory available for the hash table on the first
pass. Suppose also that the data file has a billion baskets, each with ten items. A bucket is
an integer, typically 4 bytes, so we can maintain a quarter of a billion buckets. The number
of pairs in all the baskets is pairs; this number is also the sum of the counts in the
buckets. Thus, the average count is 4 . 5 × 10 10 /2 . 5 × 10 8 , or 180. If the support threshold s
is around 180 or less, we might expect few buckets to be infrequent. However, if s is much
Search WWH ::




Custom Search