Database Reference
In-Depth Information
4.2 Sampling Data in a Stream
As our first example of managing streaming data, we shall look at extracting reliable
samples from a stream. As with many stream algorithms, the “trick” involves using hashing
in a somewhat unusual way.
4.2.1
A Motivating Example
The general problem we shall address is selecting a subset of a stream so that we can ask
queries about the selected subset and have the answers be statistically representative of the
stream as a whole. If we know what queries are to be asked, then there are a number of
methods that might work, but we are looking for a technique that will allow ad-hoc quer-
ies on the sample. We shall look at a particular problem, from which the general idea will
emerge.
Our running example is the following. A search engine receives a stream of queries,
and it would like to study the behavior of typical users. 1 We assume the stream consists of
tuples (user, query, time). Suppose that we want to answer queries such as “What fraction
of the typical user's queries were repeated over the past month?” Assume also that we wish
to store only 1/10th of the stream elements.
The obvious approach would be to generate a random number, say an integer from 0 to
9, in response to each search query. Store the tuple if and only if the random number is 0. If
we do so, each user has, on average, 1/10th of their queries stored. Statistical fluctuations
will introduce some noise into the data, but if users issue many queries, the law of large
numbers will assure us that most users will have a fraction quite close to 1/10th of their
queries stored.
However, this scheme gives us the wrong answer to the query asking for the average
number of duplicate queries for a user. Suppose a user has issued s search queries one time
in the past month, d search queries twice, and no search queries more than twice. If we
have a 1/10th sample, of queries, we shall see in the sample for that user an expected s /10
of the search queries issued once. Of the d search queries issued twice, only d /100 will ap-
pear twice in the sample; that fraction is d times the probability that both occurrences of
the query will be in the 1/10th sample. Of the queries that appear twice in the full stream,
18 d /100 will appear exactly once. To see why, note that 18/100 is the probability that one
of the two occurrences will be in the 1/10th of the stream that is selected, while the other is
in the 9/10th that is not selected.
The correct answer to the query about the fraction of repeated searches is d /( s + d ).
However, the answer we shall obtain from the sample is d /(10 s + 19 d ). To derive the latter
formula, note that d /100 appear twice, while s /10+18 d /100 appear once. Thus, the fraction
Search WWH ::




Custom Search