Frequent Itemsets - Mining of Massive Datasets

Databases Reference

In-Depth Information

6.2.1 Representation of Market-Basket Data

As we mentioned, we assume that market-basket data is stored in a file basket-

by-basket. Possibly, the data is in a distributed file system as in Section 2.1,

and the baskets are the objects the file contains. Or the data may be stored

in a conventional file, with a character code to represent the baskets and their

items.

Example 6.4 : We could imagine that such a file begins:

{23,456,1001}{3,18,92,145}{...

Here, the character{begins a basket and the character}ends it. The items in

a basket are represented by integers and are separated by commas. Thus, the

first basket contains items 23, 456, and 1001; the second basket contains items

3, 18, 92, and 145.

2

It may be that one machine receives the entire file. Or we could be using

map-reduce or a similar tool to divide the work among many processors, in

which case each processor receives only a part of the file. It turns out that

combining the work of parallel processors to get the exact collection of itemsets

that meet a global support threshold is hard, and we shall address this question

only in Section 6.4.4.

We also assume that the size of the file of baskets is su ciently large that it

does not fit in main memory. Thus, a major cost of any algorithm is the time

it takes to read the baskets from disk. Once a disk block full of baskets is in

main memory, we can expand it, generating all the subsets of size k. Since one

of the assumptions of our model is that the average size of a basket is small,

generating all the pairs in main memory should take time that is much less

than the time it took to read the basket from disk. For example, if there are 20

items in a basket, then there are

20

2

= 190 pairs of items in the basket, and

these can be generated easily in a pair of nested for-loops.

As the size of the subsets we want to generate gets larger, the time required

grows larger; in fact takes approximately time n k /k! to generate all the subsets

of size k for a basket with n items. Eventually, this time dominates the time

needed to transfer the data from disk. However:

1. Often, we need only small frequent itemsets, so k never grows beyond 2

or 3.

2. And when we do need the itemsets for a large size k, it is usually possible

to eliminate many of the items in each basket as not able to participate

in a frequent itemset, so the value of n drops as k increases.

The conclusion we would like to draw is that the work of examining each of

the baskets can usually be assumed proportional to the size of the file. We can

thus measure the running time of a frequent-itemset algorithm by the number

of times that each disk block of the data file is read.

Mining of Massive Datasets

Search WWH ::

Custom Search

Home