Databases Reference
In-Depth Information
time it appears, its score is only 1 (since 1, or (1−c) 0 , is the weight of the
current item).
If we wish to adapt this method to streams of baskets, there are two modifi-
cations we must make. The first is simple. Stream elements are baskets rather
than individual items, so many items may appear at a given stream element.
Treat each of those items as if they were the “current” item and add 1 to their
score after multiplying all current scores by 1−c, as described in Section 4.7.3.
If some items in a basket have no current score, initialize the scores of those
items to 1.
The second modification is trickier. We want to find all frequent itemsets,
not just singleton itemsets. If we were to initialize a count for an itemset
whenever we saw it, we would have too many counts. For example, one basket
of 20 items has over a million subsets, and all of these would have to be initiated
for one basket. On the other hand, as we mentioned, if we use a requirement
above 1 for initiating the scoring of an itemset, then we would never get any
itemsets started, and the method would not work.
A way of dealing with this problem is to start scoring certain itemsets as
soon as we see one instance, but be conservative about which itemsets we start.
We may borrow from the A-Priori trick, and only start an itemset I if all its
immediate proper subsets are already being scored. The consequence of this
restriction is that if I is truly frequent, eventually we shall begin to count it,
but we never start an itemset unless it would at least be a candidate in the
sense used in the A-Priori Algorithm.
Example 6.12 : Suppose I is a large itemset, but it appears in the stream
periodically, once every 2/c baskets. Then its score, and that of its subsets,
never falls below e −1/2 , which is greater than 1/2. Thus, once a score is created
for some subset of I, that subset will continue to be scored forever. The first
time I appears, only its singleton subsets will have scores created for them.
However, the next time I appears, each of its doubleton subsets will commence
scoring, since each of the immediate subsets of those doubletons is already being
scored. Likewise, the kth time I appears, its subsets of size k−1 are all being
scored, so we initiate scores for each of its subsets of size k.
Eventually, we
reach the size|I|, at which time we start scoring I itself.
2
6.5.3 Hybrid Methods
The approach of Section 6.5.2 offers certain advantages. It requires a limited
amount of work each time a stream element arrives, and it always provides
an up-to-date picture of what is frequent in the decaying window. Its big
disadvantage is that it requires us to maintain scores for each itemset with
a score of at least 1/2. We can limit the number of itemsets being scored
by increasing the value of the parameter c. But the larger c is, the smaller the
decaying window is. Thus, we could be forced to accept information that tracks
the local fluctuations in frequency too closely, rather than integrating over a
long period.
Search WWH ::




Custom Search