Frequent Itemsets - Mining of Massive Datasets

Databases Reference

In-Depth Information

Moreover, all the algorithms we discuss have the property that they read the

basket file sequentially. Thus, algorithms can be characterized by the number

of passes through the basket file that they make, and their running time is

proportional to the product of the number of passes they make through the

basket file times the size of that file. Since we cannot control the amount of

data, only the number of passes taken by the algorithm matters, and it is that

aspect of the algorithm that we shall focus upon when measuring the running

time of a frequent-itemset algorithm.

6.2.2 Use of Main Memory for Itemset Counting

There is a second data-related issue that we must examine, however. All

frequent-itemset algorithms require us to maintain many different counts as

we make a pass through the data. For example, we might need to count the

number of times that each pair of items occurs in baskets. If we do not have

enough main memory to store each of the counts, then adding 1 to a random

count will most likely require us to load a page from disk. In that case, the

algorithm will thrash and run many orders of magnitude slower than if we were

certain to find each count in main memory. The conclusion is that we cannot

count anything that doesn't fit in main memory. Thus, each algorithm has a

limit on how many items it can deal with.

Example 6.5 : Suppose a certain algorithm has to count all pairs of items,

and there are n items.

n

2

integers, or about

n 2 /2 integers. If integers take 4 bytes, we require 2n 2 bytes. If our machine

has 2 gigabytes, or 2 31 bytes of main memory, then we require n≤2 15 , or

approximately n < 33,000.

We thus need space to store

2

n

2

counts in a way that makes it easy to find

the count for a pair{i, j}. First, we have not assumed anything about how

items are represented. They might, for instance, be strings like “bread.” It

is more space-e cient to represent items by consecutive integers from 1 to n,

where n is the number of distinct items. Unless items are already represented

this way, we need a hash table that translates items as they appear in the file

to integers. That is, each time we see an item in the file, we hash it. If it is

already in the hash table, we can obtain its integer code from its entry in the

table. If the item is not there, we assign it the next available number (from a

count of the number of distinct items seen so far) and enter the item and its

code into the table.

It is not trivial to store the

The Triangular-Matrix Method

Even after coding items as integers, we still have the problem that we must

count a pair{i, j}in only one place. For example, we could order the pair so

that i < j, and only use the entry a[i, j] in a two-dimensional array a. That

strategy would make half the array useless.

A more space-e cient way is to

Mining of Massive Datasets

Search WWH ::

Custom Search

Home