Database Reference
In-Depth Information
(a) By any method, compute the support for each item and each pair of items.
(b) Which pairs hash to which buckets?
(c) Which buckets are frequent?
(d) Which pairs are counted on the second pass of the PCY Algorithm?
EXERCISE 6.3.2 Suppose we run the Multistage Algorithm on the data of Exercise 6.3.1 ,
with the same support threshold of 4. The first pass is the same as in that exercise, and for
the second pass, we hash pairs to nine buckets, using the hash function that hashes { i, j }
to bucket i + j mod 9. Determine the counts of the buckets on the second pass. Does the
second pass reduce the set of candidate pairs? Note that all items are frequent, so the only
reason a pair would not be hashed on the second pass is if it hashed to an infrequent bucket
on the first pass.
EXERCISE 6.3.3 Suppose we run the Multihash Algorithm on the data of Exercise 6.3.1 . We
shall use two hash tables with five buckets each. For one, the set { i, j }, is hashed to buck-
et 2 i + 3 j + 4 mod 5, and for the other, the set is hashed to i +4 j mod 5. Since these hash
functions are not symmetric in i and j , order the items so that i < j when evaluating each
hash function. Determine the counts of each of the 10 buckets. How large does the support
threshold have to be for the Multistage Algorithm to eliminate more pairs than the PCY
Algorithm would, using the hash table and function described in Exercise 6.3.1 ?
! EXERCISE 6.3.4 Suppose we perform the PCY Algorithm to find frequent pairs, with
market-basket data meeting the following specifications:
(1) The support threshold is 10,000.
(2) There are one million items, represented by the integers 0 , 1 , . . . , 999999.
(3) There are 250,000 frequent items, that is, items that occur 10,000 times or more.
(4) There are one million pairs that occur 10,000 times or more.
(5) There are P pairs that occur exactly once and consist of two frequent items.
(6) No other pairs occur at all.
(7) Integers are always represented by 4 bytes.
(8) When we hash pairs, they distribute among buckets randomly, but as evenly as pos-
sible; i.e., you may assume that each bucket gets exactly its fair share of the P pairs
that occur once.
Suppose there are S bytes of main memory. In order to run the PCY Algorithm successfully,
the number of buckets must be sufficiently large that most buckets are not frequent. In ad-
dition, on the second pass, there must be enough room to count all the candidate pairs. As
Search WWH ::




Custom Search