Frequent Itemsets - Mining of Massive Datasets

Database Reference

In-Depth Information

to eliminate those itemsets that were identified as frequent in the sample but are not fre-

quent in the whole, we have no false positives and hopefully have none or very few false

negatives.

6.4.3

The Algorithm of Savasere, Omiecinski, and Navathe

Our next improvement avoids both false negatives and false positives, at the cost of making

two full passes. It is called the SON Algorithm after the authors. The idea is to divide the

input file into chunks (which may be “chunks” in the sense of a distributed file system, or

simply a piece of the file). Treat each chunk as a sample, and run the algorithm of Section

6.4.1 on that chunk. We use ps as the threshold, if each chunk is fraction p of the whole file,

and s is the support threshold. Store on disk all the frequent itemsets found for each chunk.

Once all the chunks have been processed in that way, take the union of all the itemsets

that have been found frequent for one or more chunks. These are the candidate itemsets.

Notice that if an itemset is not frequent in any chunk, then its support is less than ps in

each chunk. Since the number of chunks is 1/ p , we conclude that the total support for that

itemset is less than (1/ p ) ps = s . Thus, every itemset that is frequent in the whole is frequent

in at least one chunk, and we can be sure that all the truly frequent itemsets are among the

candidates; i.e., there are no false negatives.

We have made a total of one pass through the data as we read each chunk and processed

it. In a second pass, we count all the candidate itemsets and select those that have support

at least s as the frequent itemsets.

6.4.4

The SON Algorithm and MapReduce

The SON algorithm lends itself well to a parallel-computing environment. Each of the

chunks can be processed in parallel, and the frequent itemsets from each chunk combined

to form the candidates. We can distribute the candidates to many processors, have each pro-

cessor count the support for each candidate in a subset of the baskets, and finally sum those

supports to get the support for each candidate itemset in the whole dataset. This process

does not have to be implemented in MapReduce, but there is a natural way of expressing

each of the two passes as a MapReduce operation. We shall summarize this MapReduce-

MapReduce sequence below.

First Map Function: Take the assigned subset of the baskets and find the itemsets frequent

in the subset using the algorithm of Section 6.4.1 . As described there, lower the support

threshold from s to ps if each Map task gets fraction p of the total input file. The output is

Search WWH ::

Custom Search

Home