Databases Reference
In-Depth Information
6.4
Limited-Pass Algorithms
The algorithms for frequent itemsets discussed so far use one pass for each size
of itemset we investigate. If main memory is too small to hold the data and the
space needed to count frequent itemsets of one size, there does not seem to be
any way to avoid k passes to compute the exact collection of frequent itemsets.
However, there are many applications where it is not essential to discover every
frequent itemset. For instance, if we are looking for items purchased together at
a supermarket, we are not going to run a sale based on every frequent itemset
we find, so it is quite su cient to find most but not all of the frequent itemsets.
In this section we explore some algorithms that have been proposed to find
all or most frequent itemsets using at most two passes. We begin with the
obvious approach of using a sample of the data rather than the entire dataset.
An algorithm called SON uses two passes, gets the exact answer, and lends
itself to implementation by map-reduce or another parallel computing regime.
Finally, Toivonen's Algorithm uses two passes on average, gets an exact answer,
but may, rarely, not terminate in any given amount of time.
6.4.1
The Simple, Randomized Algorithm
Instead of using the entire file of baskets, we could pick a random subset of
the baskets and pretend it is the entire dataset. We must adjust the support
threshold to reflect the smaller number of baskets. For instance, if the support
threshold for the full dataset is s, and we choose a sample of 1% of the baskets,
then we should examine the sample for itemsets that appear in at least s/100
of the baskets.
The safest way to pick the sample is to read the entire dataset, and for each
basket, select that basket for the sample with some fixed probability p. Suppose
there are m baskets in the entire file. At the end, we shall have a sample whose
size is very close to pm baskets. However, if we have reason to believe that the
baskets appear in random order in the file already, then we do not even have
to read the entire file. We can select the first pm baskets for our sample. Or, if
the file is part of a distributed file system, we can pick some chunks at random
to serve as the sample.
Having selected our sample of the baskets, we use part of main memory
to store these baskets. The balance of the main memory is used to execute
one of the algorithms we have discussed, such as A-Priori, PCY, Multistage,
or Multihash. However, the algorithm must run passes over the main-memory
sample for each itemset size, until we find a size with no frequent items. There
are no disk accesses needed to read the sample, since it resides in main memory.
As frequent itemsets of each size are discovered, they can be written out to disk;
this operation and the initial reading of the sample from disk are the only disk
I/O's the algorithm does.
Of course the algorithm will fail if whichever method from Section 6.2 or 6.3
we choose cannot be run in the amount of main memory left after storing the
Search WWH ::




Custom Search