Databases Reference
In-Depth Information
more frequent pairs than frequent triples, more frequent triples than frequent
quadruples, and so on.
That argument would not be enough were it impossible to avoid counting
all the triples, since there are many more triples than pairs. It is the job of the
A-Priori Algorithm and related algorithms to avoid counting many triples or
larger sets, and they are, as we shall see, effective in doing so. Thus, in what
follows, we concentrate on algorithms for computing frequent pairs.
6.2.5
The A-Priori Algorithm
For the moment, let us concentrate on finding the frequent pairs only. If we have
enough main memory to count all pairs, using either of the methods discussed
in Section 6.2.2 (triangular matrix or triples), then it is a simple matter to read
the file of baskets in a single pass. For each basket, we use a double loop to
generate all the pairs. Each time we generate a pair, we add 1 to its count.
At the end, we examine all pairs and see which have counts that exceed the
support threshold s; these are the frequent pairs.
However, this simple approach fails if there are too many pairs of items to
count them all in main memory. The A-Priori Algorithm is designed to reduce
the number of pairs that must be counted, at the expense of performing two
passes over data, rather than one pass.
The First Pass of A-Priori
In the first pass, we create two tables. The first table, if necessary, translates
item names into integers from 1 to n, as described in Section 6.2.2. The other
table is an array of counts; the ith array element counts the occurrences of the
item numbered i. Initially, the counts for all the items are 0.
As we read baskets, we look at each item in the basket and translate its
name into an integer. Next, we use that integer to index into the array of
counts, and we add 1 to the integer found there.
Between the Passes of A-Priori
After the first pass, we examine the counts of the items to determine which of
them are frequent as singletons. It might appear surprising that many singletons
are not frequent. But remember that we set the threshold s su ciently high
that we do not get too many frequent sets; a typical s would be 1% of the
baskets. If we think about our own visits to a supermarket, we surely buy
certain things more than 1% of the time: perhaps milk, bread, Coke or Pepsi,
and so on. We can even believe that 1% of the customers buy diapers, even
though we may not do so. However, many of the items on the shelves are surely
not bought by 1% of the customers: Creamy Caesar Salad Dressing for example.
For the second pass of A-Priori, we create a new numbering from 1 to m for
just the frequent items. This table is an array indexed 1 to n, and the entry
Search WWH ::




Custom Search