Frequent Itemsets - Mining of Massive Datasets

Databases Reference

In-Depth Information

6.4

Limited-Pass Algorithms

The algorithms for frequent itemsets discussed so far use one pass for each size

of itemset we investigate. If main memory is too small to hold the data and the

space needed to count frequent itemsets of one size, there does not seem to be

any way to avoid k passes to compute the exact collection of frequent itemsets.

However, there are many applications where it is not essential to discover every

frequent itemset. For instance, if we are looking for items purchased together at

a supermarket, we are not going to run a sale based on every frequent itemset

we find, so it is quite su cient to find most but not all of the frequent itemsets.

In this section we explore some algorithms that have been proposed to find

all or most frequent itemsets using at most two passes. We begin with the

obvious approach of using a sample of the data rather than the entire dataset.

An algorithm called SON uses two passes, gets the exact answer, and lends

itself to implementation by map-reduce or another parallel computing regime.

Finally, Toivonen's Algorithm uses two passes on average, gets an exact answer,

but may, rarely, not terminate in any given amount of time.

6.4.1

The Simple, Randomized Algorithm

Instead of using the entire file of baskets, we could pick a random subset of

the baskets and pretend it is the entire dataset. We must adjust the support

threshold to reflect the smaller number of baskets. For instance, if the support

threshold for the full dataset is s, and we choose a sample of 1% of the baskets,

then we should examine the sample for itemsets that appear in at least s/100

of the baskets.

The safest way to pick the sample is to read the entire dataset, and for each

basket, select that basket for the sample with some fixed probability p. Suppose

there are m baskets in the entire file. At the end, we shall have a sample whose

size is very close to pm baskets. However, if we have reason to believe that the

baskets appear in random order in the file already, then we do not even have

to read the entire file. We can select the first pm baskets for our sample. Or, if

the file is part of a distributed file system, we can pick some chunks at random

to serve as the sample.

Having selected our sample of the baskets, we use part of main memory

to store these baskets. The balance of the main memory is used to execute

one of the algorithms we have discussed, such as A-Priori, PCY, Multistage,

or Multihash. However, the algorithm must run passes over the main-memory

sample for each itemset size, until we find a size with no frequent items. There

are no disk accesses needed to read the sample, since it resides in main memory.

As frequent itemsets of each size are discovered, they can be written out to disk;

this operation and the initial reading of the sample from disk are the only disk

I/O's the algorithm does.

Of course the algorithm will fail if whichever method from Section 6.2 or 6.3

we choose cannot be run in the amount of main memory left after storing the

Mining of Massive Datasets

Search WWH ::

Custom Search

Home