Database Reference
In-Depth Information
6.2.1
Representation of Market-Basket Data
As we mentioned, we assume that market-basket data is stored in a file basket-by-basket.
Possibly, the data is in a distributed file system as in Section 2.1 , and the baskets are the
objects the file contains. Or the data may be stored in a conventional file, with a character
code to represent the baskets and their items.
EXAMPLE 6.4 We could imagine that such a file begins:
{23,456,1001}{3,18,92,145}{...
Here, the character { begins a basket and the character } ends it. The items in a basket are
represented by integers and are separated by commas. Thus, the first basket contains items
23, 456, and 1001; the second basket contains items 3, 18, 92, and 145.
It may be that one machine receives the entire file. Or we could be using MapReduce
or a similar tool to divide the work among many processors, in which case each processor
receives only a part of the file. It turns out that combining the work of parallel processors
to get the exact collection of itemsets that meet a global support threshold is hard, and we
shall address this question only in Section 6.4.4 .
We also assume that the size of the file of baskets is sufficiently large that it does not fit
in main memory. Thus, a major cost of any algorithm is the time it takes to read the baskets
from disk. Once a disk block full of baskets is in main memory, we can expand it, generat-
ing all the subsets of size k . Since one of the assumptions of our model is that the average
size of a basket is small, generating all the pairs in main memory should take time that is
much less than the time it took to read the basket from disk. For example, if there are 20
items in a basket, then there are
pairs of items in the basket, and these can be generated
easily in a pair of nested for-loops.
As the size of the subsets we want to generate gets larger, the time required grows larger;
in fact takes approximately time n k / k ! to generate all the subsets of size k for a basket with
n items. Eventually, this time dominates the time needed to transfer the data from disk.
However:
(1) Often, we need only small frequent itemsets, so k never grows beyond 2 or 3.
(2) And when we do need the itemsets for a large size k , it is usually possible to eliminate
many of the items in each basket as not able to participate in a frequent itemset, so the
value of n drops as k increases.
The conclusion we would like to draw is that the work of examining each of the baskets can
usually be assumed proportional to the size of the file. We can thus measure the running
Search WWH ::




Custom Search