Database Reference
In-Depth Information
for the sample to s /100, what is the probability that i will be found to be frequent? You may
assume that both s and n are divisible by 100.
6.5 Counting Frequent Items in a Stream
Suppose that instead of a file of baskets we have a stream of baskets, from which we want
to mine the frequent itemsets. Recall from Chapter 4 that the difference between a stream
and a data file is that stream elements are only available when they arrive, and typically
the arrival rate is so great that we cannot store the entire stream in a way that allows easy
querying. Further, it is common that streams evolve over time, so the itemsets that are fre-
quent in today's stream may not be frequent tomorrow.
A clear distinction between streams and files, when frequent itemsets are considered, is
that there is no end to a stream, so eventually an itemset is going to exceed the support
threshold, as long as it appears repeatedly in the stream. As a result, for streams, we must
think of the support threshold s as a fraction of the baskets in which an itemset must appear
in order to be considered frequent. Even with this adjustment, we still have several options
regarding the portion of the stream over which that fraction is measured.
In this section, we shall discuss several ways that we might extract frequent itemsets
from a stream. First, we consider ways to use the sampling techniques of the previous
section. Then, we consider the decaying-window model from Section 4.7 , and extend the
method described in Section 4.7.3 for finding “popular” items.
6.5.1
Sampling Methods for Streams
In what follows, we shall assume that stream elements are baskets of items. Perhaps the
simplest approach to maintaining a current estimate of the frequent itemsets in a stream is
to collect some number of baskets and store it as a file. Run one of the frequent-itemset
algorithms discussed in this chapter, meanwhile ignoring the stream elements that arrive,
or storing them as another file to be analyzed later. When the frequent-itemsets algorithm
finishes, we have an estimate of the frequent itemsets in the stream. We then have several
options.
(1) We can use this collection of frequent itemsets for whatever application is at hand, but
start running another iteration of the chosen frequent-itemset algorithm immediately.
This algorithm can either:
Search WWH ::




Custom Search