Frequent Itemsets - Mining of Massive Datasets

Database Reference

In-Depth Information

item until the item appears in the stream, and the first time it appears, its score is only 1

(since 1, or (1 − c ) 0 , is the weight of the current item).

If we wish to adapt this method to streams of baskets, there are two modifications we

must make. The first is simple. Stream elements are baskets rather than individual items,

so many items may appear at a given stream element. Treat each of those items as if they

were the “current” item and add 1 to their score after multiplying all current scores by 1− c ,

as described in Section 4.7.3 . If some items in a basket have no current score, initialize the

scores of those items to 1.

The second modification is trickier. We want to find all frequent itemsets, not just

singleton itemsets. If we were to initialize a count for an itemset whenever we saw it, we

would have too many counts. For example, one basket of 20 items has over a million sub-

sets, and all of these would have to be initiated for one basket. On the other hand, as we

mentioned, if we use a requirement above 1 for initiating the scoring of an itemset, then we

would never get any itemsets started, and the method would not work.

A way of dealing with this problem is to start scoring certain itemsets as soon as we see

one instance, but be conservative about which itemsets we start. We may borrow from the

A-Priori trick, and only start an itemset I if all its immediate proper subsets are already be-

ing scored. The consequence of this restriction is that if I is truly frequent, eventually we

shall begin to count it, but we never start an itemset unless it would at least be a candidate

in the sense used in the A-Priori Algorithm.

EXAMPLE 6.12 Suppose I is a large itemset, but it appears in the stream periodically, once

every 2/ c baskets. Then its score, and that of its subsets, never falls below e −1/2 , which is

greater than 1/2. Thus, once a score is created for some subset of I , that subset will contin-

ue to be scored forever. The first time I appears, only its singleton subsets will have scores

created for them. However, the next time I appears, each of its doubleton subsets will com-

mence scoring, since each of the immediate subsets of those doubletons is already being

scored. Likewise, the k th time I appears, its subsets of size k − 1 are all being scored, so

we initiate scores for each of its subsets of size k . Eventually, we reach the size | I |, at which

time we start scoring I itself.

□

6.5.3

Hybrid Methods

The approach of Section 6.5.2 offers certain advantages. It requires a limited amount of

work each time a stream element arrives, and it always provides an upto-date picture of

what is frequent in the decaying window. Its big disadvantage is that it requires us to main-

tain scores for each itemset with a score of at least 1/2. We can limit the number of itemsets

being scored by increasing the value of the parameter c . But the larger c is, the smaller the

Search WWH ::

Custom Search

Home