Databases Reference
In-Depth Information
Moreover, all the algorithms we discuss have the property that they read the
basket file sequentially. Thus, algorithms can be characterized by the number
of passes through the basket file that they make, and their running time is
proportional to the product of the number of passes they make through the
basket file times the size of that file. Since we cannot control the amount of
data, only the number of passes taken by the algorithm matters, and it is that
aspect of the algorithm that we shall focus upon when measuring the running
time of a frequent-itemset algorithm.
6.2.2 Use of Main Memory for Itemset Counting
There is a second data-related issue that we must examine, however. All
frequent-itemset algorithms require us to maintain many different counts as
we make a pass through the data. For example, we might need to count the
number of times that each pair of items occurs in baskets. If we do not have
enough main memory to store each of the counts, then adding 1 to a random
count will most likely require us to load a page from disk. In that case, the
algorithm will thrash and run many orders of magnitude slower than if we were
certain to find each count in main memory. The conclusion is that we cannot
count anything that doesn't fit in main memory. Thus, each algorithm has a
limit on how many items it can deal with.
Example 6.5 : Suppose a certain algorithm has to count all pairs of items,
and there are n items.
n
2
integers, or about
n 2 /2 integers. If integers take 4 bytes, we require 2n 2 bytes. If our machine
has 2 gigabytes, or 2 31 bytes of main memory, then we require n≤2 15 , or
approximately n < 33,000.
We thus need space to store
2
n
2
counts in a way that makes it easy to find
the count for a pair{i, j}. First, we have not assumed anything about how
items are represented. They might, for instance, be strings like “bread.” It
is more space-e cient to represent items by consecutive integers from 1 to n,
where n is the number of distinct items. Unless items are already represented
this way, we need a hash table that translates items as they appear in the file
to integers. That is, each time we see an item in the file, we hash it. If it is
already in the hash table, we can obtain its integer code from its entry in the
table. If the item is not there, we assign it the next available number (from a
count of the number of distinct items seen so far) and enter the item and its
code into the table.
It is not trivial to store the
The Triangular-Matrix Method
Even after coding items as integers, we still have the problem that we must
count a pair{i, j}in only one place. For example, we could order the pair so
that i < j, and only use the entry a[i, j] in a two-dimensional array a. That
strategy would make half the array useless.
A more space-e cient way is to
Search WWH ::




Custom Search