Mining Data Streams - Mining of Massive Datasets

Database Reference

In-Depth Information

appearing twice in the sample is d /100 divided by d /100 + s /10 + 18 d /100. This ratio is

d /(10 s + 19 d ). For no positive values of s and d is d /( s + d ) = d /(10 s + 19 d ).

4.2.2

Obtaining a Representative Sample

The query of Section 4.2.1 , like many queries about the statistics of typical users, cannot

be answered by taking a sample of each user's search queries. Thus, we must strive to pick

1/10th of the users, and take all their searches for the sample, while taking none of the

searches from other users. If we can store a list of all users, and whether or not they are in

the sample, then we could do the following. Each time a search query arrives in the stream,

we look up the user to see whether or not they are in the sample. If so, we add this search

query to the sample, and if not, then not. However, if we have no record of ever having

seen this user before, then we generate a random integer between 0 and 9. If the number is

0, we add this user to our list with value “in,” and if the number is other than 0, we add the

user with the value “out.”

That method works as long as we can afford to keep the list of all users and their in/

out decision in main memory, because there isn't time to go to disk for every search that

arrives. By using a hash function, one can avoid keeping the list of users. That is, we hash

each user name to one of ten buckets, 0 through 9. If the user hashes to bucket 0, then ac-

cept this search query for the sample, and if not, then not.

Note we do not actually store the user in the bucket; in fact, there is no data in the buckets

at all. Effectively, we use the hash function as a random-number generator, with the im-

portant property that, when applied to the same user several times, we always get the same

“'random” number. That is, without storing the in/out decision for any user, we can recon-

struct that decision any time a search query by that user arrives.

More generally, we can obtain a sample consisting of any rational fraction a / b of the

users by hashing user names to b buckets, 0 through b − 1. Add the search query to the

sample if the hash value is less than a .

4.2.3

The General Sampling Problem

The running example is typical of the following general problem. Our stream consists of

tuples with n components. A subset of the components are the key components, on which

the selection of the sample will be based. In our running example, there are three compon-

ents - user, query, and time - of which only user is in the key. However, we could also take

a sample of queries by making query be the key, or even take a sample of user-query pairs

by making both those components form the key.

Search WWH ::

Custom Search

Home