Databases Reference
In-Depth Information
We can combine the ideas from Sections 6.5.1 and 6.5.2. For example, we
could run a standard algorithm for frequent itemsets on a sample of the stream,
with a conventional threshold for support. The itemsets that are frequent by
this standard will be treated as if they all arrived at the current time. That is,
they each get a score equal to a fixed fraction of their count.
More precisely, suppose the initial sample has b baskets, c is the decay
constant for the decaying window, and the minimum score we wish to accept
for a frequent itemset in the decaying window is s. Then the support threshold
for the initial run of the frequent-itemset algorithm is bcs. If an itemset I is
found to have support t in the sample, then it is initially given a score of t/(bc).
Example 6.13 : Suppose c = 10 −6 and the minimum score we wish to accept
in the decaying window is 10. Suppose also we take a sample of 10 8 baskets
from the stream. Then when analyzing that sample, we use a support threshold
of 10 8 ×10 −6 ×10 = 1000.
Consider an itemset I that has support 2000 in the sample. Then the initial
score we use for I is 2000/(10 8 ×10 −6 ) = 20. After this initiation step, each
time a basket arrives in the stream, the current score will be multiplied by
1−c = 0.999999. If I is a subset of the current basket, then add 1 to the score.
If the score for I goes below 10, then it is considered to be no longer frequent,
so it is dropped from the collection of frequent itemsets.
2
We do not, sadly, have a reasonable way of initiating the scoring of new
itemsets. If we have no score for itemset I, and 10 is the minimum score we
want to maintain, there is no way that a single basket can jump its score from
0 to anything more than 1. The best strategy for adding new sets is to run a
new frequent-itemsets calculation on a sample from the stream, and add to the
collection of itemsets being scored any that meet the threshold for that sample
but were not previously being scored.
6.5.4
Exercises for Section 6.5
!! Exercise 6.5.1 : Suppose we are counting frequent itemsets in a decaying win-
dow with a decay constant c. Suppose also that with probability p, a given
stream element (basket) contains both items i and j. Additionally, with prob-
ability p the basket contains i but not j, and with probability p it contains j
but not i. As a function of c and p, what is the fraction of time we shall be
scoring the pair{i, j}?
6.6
Summary of Chapter 6
F
Market-Basket Data: This model of data assumes there are two kinds of
entities: items and baskets. There is a many-many relationship between
items and baskets. Typically, baskets are related to small sets of items,
while items may be related to many baskets.
Search WWH ::




Custom Search