Databases Reference
In-Depth Information
(b) What is the negative border?
(c) What is the outcome of the pass through the full dataset? Are any of the
itemsets in the negative border frequent in the whole?
!! Exercise 6.4.3 : Suppose item i appears exactly s times in a file of n baskets,
where s is the support threshold. If we take a sample of n/100 baskets, and
lower the support threshold for the sample to s/100, what is the probability
that i will be found to be frequent? You may assume that both s and n are
divisible by 100.
6.5
Counting Frequent Items in a Stream
Suppose that instead of a file of baskets we have a stream of baskets, from
which we want to mine the frequent itemsets. Recall from Chapter 4 that the
difference between a stream and a data file is that stream elements are only
available when they arrive, and typically the arrival rate is so great that we
cannot store the entire stream in a way that allows easy querying. Further, it
is common that streams evolve over time, so the itemsets that are frequent in
today's stream may not be frequent tomorrow.
A clear distinction between streams and files, when frequent itemsets are
considered, is that there is no end to a stream, so eventually an itemset is going
to exceed the support threshold, as long as it appears repeatedly in the stream.
As a result, for streams, we must think of the support threshold s as a fraction of
the baskets in which an itemset must appear in order to be considered frequent.
Even with this adjustment, we still have several options regarding the portion
of the stream over which that fraction is measured.
In this section, we shall discuss several ways that we might extract frequent
itemsets from a stream. First, we consider ways to use the sampling tech-
niques of the previous section. Then, we consider the decaying-window model
from Section 4.7, and extend the method described in Section 4.7.3 for finding
“popular” items.
6.5.1 Sampling Methods for Streams
In what follows, we shall assume that stream elements are baskets of items.
Perhaps the simplest approach to maintaining a current estimate of the frequent
itemsets in a stream is to collect some number of baskets and store it as a file.
Run one of the frequent-itemset algorithms discussed in this chapter, meanwhile
ignoring the stream elements that arrive, or storing them as another file to be
analyzed later. When the frequent-itemsets algorithm finishes, we have an
estimate of the frequent itemsets in the stream. We then have several options.
1. We can use this collection of frequent itemsets for whatever application is
at hand, but start running another iteration of the chosen frequent-itemset
algorithm immediately. This algorithm can either:
Search WWH ::




Custom Search