Frequent Itemsets - Mining of Massive Datasets

Database Reference

In-Depth Information

6.2.1

Representation of Market-Basket Data

As we mentioned, we assume that market-basket data is stored in a file basket-by-basket.

Possibly, the data is in a distributed file system as in Section 2.1 , and the baskets are the

objects the file contains. Or the data may be stored in a conventional file, with a character

code to represent the baskets and their items.

EXAMPLE 6.4 We could imagine that such a file begins:

{23,456,1001}{3,18,92,145}{...

Here, the character { begins a basket and the character } ends it. The items in a basket are

represented by integers and are separated by commas. Thus, the first basket contains items

23, 456, and 1001; the second basket contains items 3, 18, 92, and 145.

□

It may be that one machine receives the entire file. Or we could be using MapReduce

or a similar tool to divide the work among many processors, in which case each processor

receives only a part of the file. It turns out that combining the work of parallel processors

to get the exact collection of itemsets that meet a global support threshold is hard, and we

shall address this question only in Section 6.4.4 .

We also assume that the size of the file of baskets is sufficiently large that it does not fit

in main memory. Thus, a major cost of any algorithm is the time it takes to read the baskets

from disk. Once a disk block full of baskets is in main memory, we can expand it, generat-

ing all the subsets of size k . Since one of the assumptions of our model is that the average

size of a basket is small, generating all the pairs in main memory should take time that is

much less than the time it took to read the basket from disk. For example, if there are 20

items in a basket, then there are

pairs of items in the basket, and these can be generated

easily in a pair of nested for-loops.

As the size of the subsets we want to generate gets larger, the time required grows larger;

in fact takes approximately time n k / k ! to generate all the subsets of size k for a basket with

n items. Eventually, this time dominates the time needed to transfer the data from disk.

However:

(1) Often, we need only small frequent itemsets, so k never grows beyond 2 or 3.

(2) And when we do need the itemsets for a large size k , it is usually possible to eliminate

many of the items in each basket as not able to participate in a frequent itemset, so the

value of n drops as k increases.

The conclusion we would like to draw is that the work of examining each of the baskets can

usually be assumed proportional to the size of the file. We can thus measure the running

Search WWH ::

Custom Search

Home