Frequent Itemsets - Mining of Massive Datasets

Databases Reference

In-Depth Information

time it appears, its score is only 1 (since 1, or (1−c) 0 , is the weight of the

current item).

If we wish to adapt this method to streams of baskets, there are two modifi-

cations we must make. The first is simple. Stream elements are baskets rather

than individual items, so many items may appear at a given stream element.

Treat each of those items as if they were the “current” item and add 1 to their

score after multiplying all current scores by 1−c, as described in Section 4.7.3.

If some items in a basket have no current score, initialize the scores of those

items to 1.

The second modification is trickier. We want to find all frequent itemsets,

not just singleton itemsets. If we were to initialize a count for an itemset

whenever we saw it, we would have too many counts. For example, one basket

of 20 items has over a million subsets, and all of these would have to be initiated

for one basket. On the other hand, as we mentioned, if we use a requirement

above 1 for initiating the scoring of an itemset, then we would never get any

itemsets started, and the method would not work.

A way of dealing with this problem is to start scoring certain itemsets as

soon as we see one instance, but be conservative about which itemsets we start.

We may borrow from the A-Priori trick, and only start an itemset I if all its

immediate proper subsets are already being scored. The consequence of this

restriction is that if I is truly frequent, eventually we shall begin to count it,

but we never start an itemset unless it would at least be a candidate in the

sense used in the A-Priori Algorithm.

Example 6.12 : Suppose I is a large itemset, but it appears in the stream

periodically, once every 2/c baskets. Then its score, and that of its subsets,

never falls below e −1/2 , which is greater than 1/2. Thus, once a score is created

for some subset of I, that subset will continue to be scored forever. The first

time I appears, only its singleton subsets will have scores created for them.

However, the next time I appears, each of its doubleton subsets will commence

scoring, since each of the immediate subsets of those doubletons is already being

scored. Likewise, the kth time I appears, its subsets of size k−1 are all being

scored, so we initiate scores for each of its subsets of size k.

Eventually, we

reach the size|I|, at which time we start scoring I itself.

2

6.5.3 Hybrid Methods

The approach of Section 6.5.2 offers certain advantages. It requires a limited

amount of work each time a stream element arrives, and it always provides

an up-to-date picture of what is frequent in the decaying window. Its big

disadvantage is that it requires us to maintain scores for each itemset with

a score of at least 1/2. We can limit the number of itemsets being scored

by increasing the value of the parameter c. But the larger c is, the smaller the

decaying window is. Thus, we could be forced to accept information that tracks

the local fluctuations in frequency too closely, rather than integrating over a

long period.

Mining of Massive Datasets

Search WWH ::

Custom Search

Home