Databases Reference
In-Depth Information
6.2.1 Representation of Market-Basket Data
As we mentioned, we assume that market-basket data is stored in a file basket-
by-basket. Possibly, the data is in a distributed file system as in Section 2.1,
and the baskets are the objects the file contains. Or the data may be stored
in a conventional file, with a character code to represent the baskets and their
items.
Example 6.4 : We could imagine that such a file begins:
{23,456,1001}{3,18,92,145}{...
Here, the character{begins a basket and the character}ends it. The items in
a basket are represented by integers and are separated by commas. Thus, the
first basket contains items 23, 456, and 1001; the second basket contains items
3, 18, 92, and 145.
2
It may be that one machine receives the entire file. Or we could be using
map-reduce or a similar tool to divide the work among many processors, in
which case each processor receives only a part of the file. It turns out that
combining the work of parallel processors to get the exact collection of itemsets
that meet a global support threshold is hard, and we shall address this question
only in Section 6.4.4.
We also assume that the size of the file of baskets is su ciently large that it
does not fit in main memory. Thus, a major cost of any algorithm is the time
it takes to read the baskets from disk. Once a disk block full of baskets is in
main memory, we can expand it, generating all the subsets of size k. Since one
of the assumptions of our model is that the average size of a basket is small,
generating all the pairs in main memory should take time that is much less
than the time it took to read the basket from disk. For example, if there are 20
items in a basket, then there are
20
2
= 190 pairs of items in the basket, and
these can be generated easily in a pair of nested for-loops.
As the size of the subsets we want to generate gets larger, the time required
grows larger; in fact takes approximately time n k /k! to generate all the subsets
of size k for a basket with n items. Eventually, this time dominates the time
needed to transfer the data from disk. However:
1. Often, we need only small frequent itemsets, so k never grows beyond 2
or 3.
2. And when we do need the itemsets for a large size k, it is usually possible
to eliminate many of the items in each basket as not able to participate
in a frequent itemset, so the value of n drops as k increases.
The conclusion we would like to draw is that the work of examining each of
the baskets can usually be assumed proportional to the size of the file. We can
thus measure the running time of a frequent-itemset algorithm by the number
of times that each disk block of the data file is read.
Search WWH ::




Custom Search