Database Reference
In-Depth Information
item until the item appears in the stream, and the first time it appears, its score is only 1
(since 1, or (1 − c ) 0 , is the weight of the current item).
If we wish to adapt this method to streams of baskets, there are two modifications we
must make. The first is simple. Stream elements are baskets rather than individual items,
so many items may appear at a given stream element. Treat each of those items as if they
were the “current” item and add 1 to their score after multiplying all current scores by 1− c ,
as described in Section 4.7.3 . If some items in a basket have no current score, initialize the
scores of those items to 1.
The second modification is trickier. We want to find all frequent itemsets, not just
singleton itemsets. If we were to initialize a count for an itemset whenever we saw it, we
would have too many counts. For example, one basket of 20 items has over a million sub-
sets, and all of these would have to be initiated for one basket. On the other hand, as we
mentioned, if we use a requirement above 1 for initiating the scoring of an itemset, then we
would never get any itemsets started, and the method would not work.
A way of dealing with this problem is to start scoring certain itemsets as soon as we see
one instance, but be conservative about which itemsets we start. We may borrow from the
A-Priori trick, and only start an itemset I if all its immediate proper subsets are already be-
ing scored. The consequence of this restriction is that if I is truly frequent, eventually we
shall begin to count it, but we never start an itemset unless it would at least be a candidate
in the sense used in the A-Priori Algorithm.
EXAMPLE 6.12 Suppose I is a large itemset, but it appears in the stream periodically, once
every 2/ c baskets. Then its score, and that of its subsets, never falls below e −1/2 , which is
greater than 1/2. Thus, once a score is created for some subset of I , that subset will contin-
ue to be scored forever. The first time I appears, only its singleton subsets will have scores
created for them. However, the next time I appears, each of its doubleton subsets will com-
mence scoring, since each of the immediate subsets of those doubletons is already being
scored. Likewise, the k th time I appears, its subsets of size k − 1 are all being scored, so
we initiate scores for each of its subsets of size k . Eventually, we reach the size | I |, at which
time we start scoring I itself.
6.5.3
Hybrid Methods
The approach of Section 6.5.2 offers certain advantages. It requires a limited amount of
work each time a stream element arrives, and it always provides an upto-date picture of
what is frequent in the decaying window. Its big disadvantage is that it requires us to main-
tain scores for each itemset with a score of at least 1/2. We can limit the number of itemsets
being scored by increasing the value of the parameter c . But the larger c is, the smaller the
Search WWH ::




Custom Search