Frequent Itemsets - Mining of Massive Datasets

Database Reference

In-Depth Information

sample of the data rather than the entire dataset. An algorithm called SON uses two passes,

gets the exact answer, and lends itself to implementation by MapReduce or another paral-

lel computing regime. Finally, Toivonen's Algorithm uses two passes on average, gets an

exact answer, but may, rarely, not terminate in any given amount of time.

Why Not Just Pick the First Part of the File?

The risk in selecting a sample from one portion of a large file is that the data is not uniformly distributed in the file.

For example, suppose the file were a list of true market-basket contents at a department store, organized by date of

sale. If you took only the first baskets in the file, you would have old data. For example, there would be no iPods in

these baskets, even though iPods might have become a popular item later.

As another example, consider a file of medical tests performed at different hospitals. If each chunk comes from a

different hospital, then picking chunks at random will give us a sample drawn from only a small subset of the hospit-

als. If hospitals perform different tests or perform them in different ways, the data may be highly biased.

6.4.1

The Simple, Randomized Algorithm

Instead of using the entire file of baskets, we could pick a random subset of the baskets and

pretend it is the entire dataset. We must adjust the support threshold to reflect the smaller

number of baskets. For instance, if the support threshold for the full dataset is s , and we

choose a sample of 1% of the baskets, then we should examine the sample for itemsets that

appear in at least s /100 of the baskets.

The safest way to pick the sample is to read the entire dataset, and for each basket, select

that basket for the sample with some fixed probability p . Suppose there are m baskets in

the entire file. At the end, we shall have a sample whose size is very close to pm baskets.

However, if we have reason to believe that the baskets appear in random order in the file

already, then we do not even have to read the entire file. We can select the first pm baskets

for our sample. Or, if the file is part of a distributed file system, we can pick some chunks

at random to serve as the sample.

Having selected our sample of the baskets, we use part of main memory to store these

baskets. The balance of the main memory is used to execute one of the algorithms we have

discussed, such as A-Priori, PCY, Multistage, or Multihash. However, the algorithm must

run passes over the main-memory sample for each itemset size, until we find a size with

no frequent items. There are no disk accesses needed to read the sample, since it resides in

main memory. As frequent itemsets of each size are discovered, they can be written out to

disk; this operation and the initial reading of the sample from disk are the only disk I/O's

the algorithm does.

Of course the algorithm will fail if whichever method from Section 6.2 or 6.3 we choose

cannot be run in the amount of main memory left after storing the sample. If we need

Search WWH ::

Custom Search

Home