Database Reference
In-Depth Information
appearing twice in the sample is d /100 divided by d /100 + s /10 + 18 d /100. This ratio is
d /(10 s + 19 d ). For no positive values of s and d is d /( s + d ) = d /(10 s + 19 d ).
4.2.2
Obtaining a Representative Sample
The query of Section 4.2.1 , like many queries about the statistics of typical users, cannot
be answered by taking a sample of each user's search queries. Thus, we must strive to pick
1/10th of the users, and take all their searches for the sample, while taking none of the
searches from other users. If we can store a list of all users, and whether or not they are in
the sample, then we could do the following. Each time a search query arrives in the stream,
we look up the user to see whether or not they are in the sample. If so, we add this search
query to the sample, and if not, then not. However, if we have no record of ever having
seen this user before, then we generate a random integer between 0 and 9. If the number is
0, we add this user to our list with value “in,” and if the number is other than 0, we add the
user with the value “out.”
That method works as long as we can afford to keep the list of all users and their in/
out decision in main memory, because there isn't time to go to disk for every search that
arrives. By using a hash function, one can avoid keeping the list of users. That is, we hash
each user name to one of ten buckets, 0 through 9. If the user hashes to bucket 0, then ac-
cept this search query for the sample, and if not, then not.
Note we do not actually store the user in the bucket; in fact, there is no data in the buckets
at all. Effectively, we use the hash function as a random-number generator, with the im-
portant property that, when applied to the same user several times, we always get the same
“'random” number. That is, without storing the in/out decision for any user, we can recon-
struct that decision any time a search query by that user arrives.
More generally, we can obtain a sample consisting of any rational fraction a / b of the
users by hashing user names to b buckets, 0 through b − 1. Add the search query to the
sample if the hash value is less than a .
4.2.3
The General Sampling Problem
The running example is typical of the following general problem. Our stream consists of
tuples with n components. A subset of the components are the key components, on which
the selection of the sample will be based. In our running example, there are three compon-
ents - user, query, and time - of which only user is in the key. However, we could also take
a sample of queries by making query be the key, or even take a sample of user-query pairs
by making both those components form the key.
Search WWH ::




Custom Search