Database Reference
In-Depth Information
time of a frequent-itemset algorithm by the number of times that each disk block of the data
file is read.
Moreover, all the algorithms we discuss have the property that they read the basket file
sequentially. Thus, algorithms can be characterized by the number of passes through the
basket file that they make, and their running time is proportional to the product of the num-
ber of passes they make through the basket file times the size of that file. Since we cannot
control the amount of data, only the number of passes taken by the algorithm matters, and
it is that aspect of the algorithm that we shall focus upon when measuring the running time
of a frequent-itemset algorithm.
6.2.2
Use of Main Memory for Itemset Counting
There is a second data-related issue that we must examine, however. All frequent-itemset
algorithms require us to maintain many different counts as we make a pass through the
data. For example, we might need to count the number of times that each pair of items oc-
curs in baskets. If we do not have enough main memory to store each of the counts, then
adding 1 to a random count will most likely require us to load a page from disk. In that
case, the algorithm will thrash and run many orders of magnitude slower than if we were
certain to find each count in main memory. The conclusion is that we cannot count any-
thing that doesn't fit in main memory. Thus, each algorithm has a limit on how many items
it can deal with.
EXAMPLE 6.5 Suppose a certain algorithm has to count all pairs of items, and there are n
items. We thus need space to store integers, or about n 2 /2 integers. If integers take 4
bytes, we require 2 n 2 bytes. If our machine has 2 gigabytes, or 2 31 bytes of main memory,
then we require n ≤ 2 15 , or approximately n < 33,000.
It is not trivial to store the counts in a way that makes it easy to find the count for
a pair { i, j }. First, we have not assumed anything about how items are represented. They
might, for instance, be strings like “bread.” It is more space-efficient to represent items by
consecutive integers from 1 to n , where n is the number of distinct items. Unless items are
already represented this way, we need a hash table that translates items as they appear in
the file to integers. That is, each time we see an item in the file, we hash it. If it is already
in the hash table, we can obtain its integer code from its entry in the table. If the item is not
there, we assign it the next available number (from a count of the number of distinct items
seen so far) and enter the item and its code into the table.
Search WWH ::




Custom Search