Frequent Itemsets - Mining of Massive Datasets

Databases Reference

In-Depth Information

(b) What is the negative border?

(c) What is the outcome of the pass through the full dataset? Are any of the

itemsets in the negative border frequent in the whole?

!! Exercise 6.4.3 : Suppose item i appears exactly s times in a file of n baskets,

where s is the support threshold. If we take a sample of n/100 baskets, and

lower the support threshold for the sample to s/100, what is the probability

that i will be found to be frequent? You may assume that both s and n are

divisible by 100.

6.5

Counting Frequent Items in a Stream

Suppose that instead of a file of baskets we have a stream of baskets, from

which we want to mine the frequent itemsets. Recall from Chapter 4 that the

difference between a stream and a data file is that stream elements are only

available when they arrive, and typically the arrival rate is so great that we

cannot store the entire stream in a way that allows easy querying. Further, it

is common that streams evolve over time, so the itemsets that are frequent in

today's stream may not be frequent tomorrow.

A clear distinction between streams and files, when frequent itemsets are

considered, is that there is no end to a stream, so eventually an itemset is going

to exceed the support threshold, as long as it appears repeatedly in the stream.

As a result, for streams, we must think of the support threshold s as a fraction of

the baskets in which an itemset must appear in order to be considered frequent.

Even with this adjustment, we still have several options regarding the portion

of the stream over which that fraction is measured.

In this section, we shall discuss several ways that we might extract frequent

itemsets from a stream. First, we consider ways to use the sampling tech-

niques of the previous section. Then, we consider the decaying-window model

from Section 4.7, and extend the method described in Section 4.7.3 for finding

“popular” items.

6.5.1 Sampling Methods for Streams

In what follows, we shall assume that stream elements are baskets of items.

Perhaps the simplest approach to maintaining a current estimate of the frequent

itemsets in a stream is to collect some number of baskets and store it as a file.

Run one of the frequent-itemset algorithms discussed in this chapter, meanwhile

ignoring the stream elements that arrive, or storing them as another file to be

analyzed later. When the frequent-itemsets algorithm finishes, we have an

estimate of the frequent itemsets in the stream. We then have several options.

1. We can use this collection of frequent itemsets for whatever application is

at hand, but start running another iteration of the chosen frequent-itemset

algorithm immediately. This algorithm can either:

Search WWH ::

Custom Search

Home