Frequent Itemsets - Mining of Massive Datasets

Database Reference

In-Depth Information

for the sample to s /100, what is the probability that i will be found to be frequent? You may

assume that both s and n are divisible by 100.

6.5 Counting Frequent Items in a Stream

Suppose that instead of a file of baskets we have a stream of baskets, from which we want

to mine the frequent itemsets. Recall from Chapter 4 that the difference between a stream

and a data file is that stream elements are only available when they arrive, and typically

the arrival rate is so great that we cannot store the entire stream in a way that allows easy

querying. Further, it is common that streams evolve over time, so the itemsets that are fre-

quent in today's stream may not be frequent tomorrow.

A clear distinction between streams and files, when frequent itemsets are considered, is

that there is no end to a stream, so eventually an itemset is going to exceed the support

threshold, as long as it appears repeatedly in the stream. As a result, for streams, we must

think of the support threshold s as a fraction of the baskets in which an itemset must appear

in order to be considered frequent. Even with this adjustment, we still have several options

regarding the portion of the stream over which that fraction is measured.

In this section, we shall discuss several ways that we might extract frequent itemsets

from a stream. First, we consider ways to use the sampling techniques of the previous

section. Then, we consider the decaying-window model from Section 4.7 , and extend the

method described in Section 4.7.3 for finding “popular” items.

6.5.1

Sampling Methods for Streams

In what follows, we shall assume that stream elements are baskets of items. Perhaps the

simplest approach to maintaining a current estimate of the frequent itemsets in a stream is

to collect some number of baskets and store it as a file. Run one of the frequent-itemset

algorithms discussed in this chapter, meanwhile ignoring the stream elements that arrive,

or storing them as another file to be analyzed later. When the frequent-itemsets algorithm

finishes, we have an estimate of the frequent itemsets in the stream. We then have several

options.

(1) We can use this collection of frequent itemsets for whatever application is at hand, but

start running another iteration of the chosen frequent-itemset algorithm immediately.

This algorithm can either:

Search WWH ::

Custom Search

Home