Database Reference
In-Depth Information
6.2.4
Tyranny of Counting Pairs
As you may have noticed, we have focused on the matter of counting pairs in the discussion
so far. There is a good reason to do so: in practice the most main memory is required for
determining the frequent pairs. The number of items, while possibly very large, is rarely so
large we cannot count all the singleton sets in main memory at the same time.
What about larger sets - triples, quadruples, and so on? Recall that in order for frequent-
itemset analysis to make sense, the result has to be a small number of sets, or we cannot
even read them all, let alone consider their significance. Thus, in practice the support
threshold is set high enough that it is only a rare set that is frequent. Monotonicity tells us
that if there is a frequent triple, then there are three frequent pairs contained within it. And
of course there may be frequent pairs contained in no frequent triple as well. Thus, we ex-
pect to find more frequent pairs than frequent triples, more frequent triples than frequent
quadruples, and so on.
That argument would not be enough were it impossible to avoid counting all the triples,
since there are many more triples than pairs. It is the job of the A-Priori Algorithm and re-
lated algorithms to avoid counting many triples or larger sets, and they are, as we shall see,
effective in doing so. Thus, in what follows, we concentrate on algorithms for computing
frequent pairs.
6.2.5
The A-Priori Algorithm
For the moment, let us concentrate on finding the frequent pairs only. If we have enough
main memory to count all pairs, using either of the methods discussed in Section 6.2.2 (tri-
angular matrix or triples), then it is a simple matter to read the file of baskets in a single
pass. For each basket, we use a double loop to generate all the pairs. Each time we generate
a pair, we add 1 to its count. At the end, we examine all pairs to see which have counts that
are equal to or greater than the support threshold s ; these are the frequent pairs.
However, this simple approach fails if there are too many pairs of items to count them
all in main memory. The A-Priori Algorithm is designed to reduce the number of pairs that
must be counted, at the expense of performing two passes over data, rather than one pass.
The First Pass of A-Priori
In the first pass, we create two tables. The first table, if necessary, translates item names in-
to integers from 1 to n , as described in Section 6.2.2 . The other table is an array of counts;
the i th array element counts the occurrences of the item numbered i . Initially, the counts for
all the items are 0.
Search WWH ::




Custom Search