Frequent Itemsets - Mining of Massive Datasets

Database Reference

In-Depth Information

6.2.4

Tyranny of Counting Pairs

As you may have noticed, we have focused on the matter of counting pairs in the discussion

so far. There is a good reason to do so: in practice the most main memory is required for

determining the frequent pairs. The number of items, while possibly very large, is rarely so

large we cannot count all the singleton sets in main memory at the same time.

What about larger sets - triples, quadruples, and so on? Recall that in order for frequent-

itemset analysis to make sense, the result has to be a small number of sets, or we cannot

even read them all, let alone consider their significance. Thus, in practice the support

threshold is set high enough that it is only a rare set that is frequent. Monotonicity tells us

that if there is a frequent triple, then there are three frequent pairs contained within it. And

of course there may be frequent pairs contained in no frequent triple as well. Thus, we ex-

pect to find more frequent pairs than frequent triples, more frequent triples than frequent

quadruples, and so on.

That argument would not be enough were it impossible to avoid counting all the triples,

since there are many more triples than pairs. It is the job of the A-Priori Algorithm and re-

lated algorithms to avoid counting many triples or larger sets, and they are, as we shall see,

effective in doing so. Thus, in what follows, we concentrate on algorithms for computing

frequent pairs.

6.2.5

The A-Priori Algorithm

For the moment, let us concentrate on finding the frequent pairs only. If we have enough

main memory to count all pairs, using either of the methods discussed in Section 6.2.2 (tri-

angular matrix or triples), then it is a simple matter to read the file of baskets in a single

pass. For each basket, we use a double loop to generate all the pairs. Each time we generate

a pair, we add 1 to its count. At the end, we examine all pairs to see which have counts that

are equal to or greater than the support threshold s ; these are the frequent pairs.

However, this simple approach fails if there are too many pairs of items to count them

all in main memory. The A-Priori Algorithm is designed to reduce the number of pairs that

must be counted, at the expense of performing two passes over data, rather than one pass.

The First Pass of A-Priori

In the first pass, we create two tables. The first table, if necessary, translates item names in-

to integers from 1 to n , as described in Section 6.2.2 . The other table is an array of counts;

the i th array element counts the occurrences of the item numbered i . Initially, the counts for

all the items are 0.

Search WWH ::

Custom Search

Home