Database Reference
In-Depth Information
decaying window is. Thus, we could be forced to accept information that tracks the local
fluctuations in frequency too closely, rather than integrating over a long period.
We can combine the ideas from Sections 6.5.1 and 6.5.2 . For example, we could run a
standard algorithm for frequent itemsets on a sample of the stream, with a conventional
threshold for support. The itemsets that are found frequent by this algorithm will be treated
as if they all arrived at the current time. That is, they each get a score equal to a fixed frac-
tion of their count.
More precisely, suppose the initial sample has b baskets, c is the decay constant for the
decaying window, and the minimum score we wish to accept for a frequent itemset in the
decaying window is s . Then the support threshold for the initial run of the frequent-itemset
algorithm is bcs . If an itemset I is found to have support t in the sample, then it is initially
given a score of t /( bc ).
EXAMPLE 6.13 Suppose c = 10 −6 and the minimum score we wish to accept in the decaying
window is 10. Suppose also we take a sample of 10 8 baskets from the stream. Then when
analyzing that sample, we use a support threshold of 10 8 × 10 −6 × 10 = 1000.
Consider an itemset I that has support 2000 in the sample. Then the initial score we use
for I is 2000/(10 8 × 10 −6 ) = 20. After this initiation step, each time a basket arrives in the
stream, the current score will be multiplied by 1 − c = 0 . 999999. If I is a subset of the cur-
rent basket, then add 1 to the score. If the score for I goes below 10, then it is considered to
be no longer frequent, so it is dropped from the collection of frequent itemsets.
We do not, sadly, have a reasonable way of initiating the scoring of new itemsets. If we
have no score for itemset I , and 10 is the minimum score we want to maintain, there is
no way that a single basket can jump its score from 0 to anything more than 1. The best
strategy for adding new sets is to run a new frequent-itemsets calculation on a sample from
the stream, and add to the collection of itemsets being scored any that meet the threshold
for that sample but were not previously being scored.
6.5.4
Exercises for Section 6.5
!! EXERCISE 6.5.1 Suppose we are counting frequent itemsets in a decaying window with
a decay constant c . Suppose also that with probability p , a given stream element (basket)
contains both items i and j . Additionally, with probability p the basket contains i but not j ,
and with probability p it contains j but not i . As a function of c and p , what is the fraction
of time we shall be scoring the pair { i, j }?
Search WWH ::




Custom Search