Mining Data Streams - Mining of Massive Datasets

Database Reference

In-Depth Information

4.2 Sampling Data in a Stream

As our first example of managing streaming data, we shall look at extracting reliable

samples from a stream. As with many stream algorithms, the “trick” involves using hashing

in a somewhat unusual way.

4.2.1

A Motivating Example

The general problem we shall address is selecting a subset of a stream so that we can ask

queries about the selected subset and have the answers be statistically representative of the

stream as a whole. If we know what queries are to be asked, then there are a number of

methods that might work, but we are looking for a technique that will allow ad-hoc quer-

ies on the sample. We shall look at a particular problem, from which the general idea will

emerge.

Our running example is the following. A search engine receives a stream of queries,

and it would like to study the behavior of typical users. 1 We assume the stream consists of

tuples (user, query, time). Suppose that we want to answer queries such as “What fraction

of the typical user's queries were repeated over the past month?” Assume also that we wish

to store only 1/10th of the stream elements.

The obvious approach would be to generate a random number, say an integer from 0 to

9, in response to each search query. Store the tuple if and only if the random number is 0. If

we do so, each user has, on average, 1/10th of their queries stored. Statistical fluctuations

will introduce some noise into the data, but if users issue many queries, the law of large

numbers will assure us that most users will have a fraction quite close to 1/10th of their

queries stored.

However, this scheme gives us the wrong answer to the query asking for the average

number of duplicate queries for a user. Suppose a user has issued s search queries one time

in the past month, d search queries twice, and no search queries more than twice. If we

have a 1/10th sample, of queries, we shall see in the sample for that user an expected s /10

of the search queries issued once. Of the d search queries issued twice, only d /100 will ap-

pear twice in the sample; that fraction is d times the probability that both occurrences of

the query will be in the 1/10th sample. Of the queries that appear twice in the full stream,

18 d /100 will appear exactly once. To see why, note that 18/100 is the probability that one

of the two occurrences will be in the 1/10th of the stream that is selected, while the other is

in the 9/10th that is not selected.

The correct answer to the query about the fraction of repeated searches is d /( s + d ).

However, the answer we shall obtain from the sample is d /(10 s + 19 d ). To derive the latter

formula, note that d /100 appear twice, while s /10+18 d /100 appear once. Thus, the fraction

Search WWH ::

Custom Search

Home