Database Reference
In-Depth Information
sample of the data rather than the entire dataset. An algorithm called SON uses two passes,
gets the exact answer, and lends itself to implementation by MapReduce or another paral-
lel computing regime. Finally, Toivonen's Algorithm uses two passes on average, gets an
exact answer, but may, rarely, not terminate in any given amount of time.
Why Not Just Pick the First Part of the File?
The risk in selecting a sample from one portion of a large file is that the data is not uniformly distributed in the file.
For example, suppose the file were a list of true market-basket contents at a department store, organized by date of
sale. If you took only the first baskets in the file, you would have old data. For example, there would be no iPods in
these baskets, even though iPods might have become a popular item later.
As another example, consider a file of medical tests performed at different hospitals. If each chunk comes from a
different hospital, then picking chunks at random will give us a sample drawn from only a small subset of the hospit-
als. If hospitals perform different tests or perform them in different ways, the data may be highly biased.
6.4.1
The Simple, Randomized Algorithm
Instead of using the entire file of baskets, we could pick a random subset of the baskets and
pretend it is the entire dataset. We must adjust the support threshold to reflect the smaller
number of baskets. For instance, if the support threshold for the full dataset is s , and we
choose a sample of 1% of the baskets, then we should examine the sample for itemsets that
appear in at least s /100 of the baskets.
The safest way to pick the sample is to read the entire dataset, and for each basket, select
that basket for the sample with some fixed probability p . Suppose there are m baskets in
the entire file. At the end, we shall have a sample whose size is very close to pm baskets.
However, if we have reason to believe that the baskets appear in random order in the file
already, then we do not even have to read the entire file. We can select the first pm baskets
for our sample. Or, if the file is part of a distributed file system, we can pick some chunks
at random to serve as the sample.
Having selected our sample of the baskets, we use part of main memory to store these
baskets. The balance of the main memory is used to execute one of the algorithms we have
discussed, such as A-Priori, PCY, Multistage, or Multihash. However, the algorithm must
run passes over the main-memory sample for each itemset size, until we find a size with
no frequent items. There are no disk accesses needed to read the sample, since it resides in
main memory. As frequent itemsets of each size are discovered, they can be written out to
disk; this operation and the initial reading of the sample from disk are the only disk I/O's
the algorithm does.
Of course the algorithm will fail if whichever method from Section 6.2 or 6.3 we choose
cannot be run in the amount of main memory left after storing the sample. If we need
Search WWH ::




Custom Search