Database Reference
In-Depth Information
a function of S , what is the largest value of P for which we can successfully run the PCY
Algorithm on this data?
! EXERCISE 6.3.5 Under the assumptions given in Exercise 6.3.4 , will the Multihash Al-
gorithm reduce the main-memory requirements for the second pass? As a function of S and
P , what is the optimum number of hash tables to use on the first pass?
! EXERCISE 6.3.6 Suppose we perform the 3-pass Multistage Algorithm to find frequent
pairs, with market-basket data meeting the following specifications:
(1) The support threshold is 10,000.
(2) There are one million items, represented by the integers 0 , 1 , . . . , 999999. All items
are frequent; that is, they occur at least 10,000 times.
(3) There are one million pairs that occur 10,000 times or more.
(4) There are P pairs that occur exactly once.
(5) No other pairs occur at all.
(6) Integers are always represented by 4 bytes.
(7) When we hash pairs, they distribute among buckets randomly, but as evenly as pos-
sible; i.e., you may assume that each bucket gets exactly its fair share of the P pairs
that occur once.
(8) The hash functions used on the first two passes are completely independent.
Suppose there are S bytes of main memory. As a function of S and P , what is the expected
number of candidate pairs on the third pass of the Multistage Algorithm?
6.4 Limited-Pass Algorithms
The algorithms for frequent itemsets discussed so far use one pass for each size of itemset
we investigate. If main memory is too small to hold the data and the space needed to count
frequent itemsets of one size, there does not seem to be any way to avoid k passes to com-
pute the exact collection of frequent itemsets. However, there are many applications where
it is not essential to discover every frequent itemset. For instance, if we are looking for
items purchased together at a supermarket, we are not going to run a sale based on every
frequent itemset we find, so it is quite sufficient to find most but not all of the frequent
itemsets.
In this section we explore some algorithms that have been proposed to find all or most
frequent itemsets using at most two passes. We begin with the obvious approach of using a
Search WWH ::




Custom Search