Frequent Itemsets - Mining of Massive Datasets

Databases Reference

In-Depth Information

more frequent pairs than frequent triples, more frequent triples than frequent

quadruples, and so on.

That argument would not be enough were it impossible to avoid counting

all the triples, since there are many more triples than pairs. It is the job of the

A-Priori Algorithm and related algorithms to avoid counting many triples or

larger sets, and they are, as we shall see, effective in doing so. Thus, in what

follows, we concentrate on algorithms for computing frequent pairs.

6.2.5

The A-Priori Algorithm

For the moment, let us concentrate on finding the frequent pairs only. If we have

enough main memory to count all pairs, using either of the methods discussed

in Section 6.2.2 (triangular matrix or triples), then it is a simple matter to read

the file of baskets in a single pass. For each basket, we use a double loop to

generate all the pairs. Each time we generate a pair, we add 1 to its count.

At the end, we examine all pairs and see which have counts that exceed the

support threshold s; these are the frequent pairs.

However, this simple approach fails if there are too many pairs of items to

count them all in main memory. The A-Priori Algorithm is designed to reduce

the number of pairs that must be counted, at the expense of performing two

passes over data, rather than one pass.

The First Pass of A-Priori

In the first pass, we create two tables. The first table, if necessary, translates

item names into integers from 1 to n, as described in Section 6.2.2. The other

table is an array of counts; the ith array element counts the occurrences of the

item numbered i. Initially, the counts for all the items are 0.

As we read baskets, we look at each item in the basket and translate its

name into an integer. Next, we use that integer to index into the array of

counts, and we add 1 to the integer found there.

Between the Passes of A-Priori

After the first pass, we examine the counts of the items to determine which of

them are frequent as singletons. It might appear surprising that many singletons

are not frequent. But remember that we set the threshold s su ciently high

that we do not get too many frequent sets; a typical s would be 1% of the

baskets. If we think about our own visits to a supermarket, we surely buy

certain things more than 1% of the time: perhaps milk, bread, Coke or Pepsi,

and so on. We can even believe that 1% of the customers buy diapers, even

though we may not do so. However, many of the items on the shelves are surely

not bought by 1% of the customers: Creamy Caesar Salad Dressing for example.

For the second pass of A-Priori, we create a new numbering from 1 to m for

just the frequent items. This table is an array indexed 1 to n, and the entry

Search WWH ::

Custom Search

Home