Mining Data Streams - Mining of Massive Datasets

Databases Reference

In-Depth Information

by d/100 + s/10 + 18d/100. This ratio is d/(10s + 19d). for no positive values

of s and d is d/(s + d) = d/(10s + 19d).

4.2.2

Obtaining a Representative Sample

The query of Section 4.2.1, like many queries about the statistics of typical

users, cannot be answered by taking a sample of each user's search queries.

Thus, we must strive to pick 1/10th of the users, and take all their searches for

the sample, while taking none of the searches from other users. If we can store

a list of all users, and whether or not they are in the sample, then we could

do the following. Each time a search query arrives in the stream, we look up

the user to see whether or not they are in the sample. If so, we add this search

query to the sample, and if not, then not. However, if we have no record of

ever having seen this user before, then we generate a random integer between

0 and 9. If the number is 0, we add this user to our list with value “in,” and if

the number is other than 0, we add the user with the value “out.”

That method works as long as we can afford to keep the list of all users and

their in/out decision in main memory, because there isn't time to go to disk for

every search that arrives. By using a hash function, one can avoid keeping the

list of users. That is, we hash each user name to one of ten buckets, 0 through

9. If the user hashes to bucket 0, then accept this search query for the sample,

and if not, then not.

Note we do not actually store the user in the bucket; in fact, there is no data

in the buckets at all. Effectively, we use the hash function as a random-number

generator, with the important property that, when applied to the same user

several times, we always get the same “'random” number. That is, without

storing the in/out decision for any user, we can reconstruct that decision any

time a search query by that user arrives.

More generally, we can obtain a sample consisting of any rational fraction

a/b of the users by hashing user names to b buckets, 0 through b−1. Add the

search query to the sample if the hash value is less than a.

4.2.3

The General Sampling Problem

The running example is typical of the following general problem. Our stream

consists of tuples with n components. A subset of the components are the key

components, on which the selection of the sample will be based. In our running

example, there are three components - user, query, and time - of which only

user is in the key. However, we could also take a sample of queries by making

query be the key, or even take a sample of user-query pairs by making both

those components form the key.

To take a sample of size a/b, we hash the key value for each tuple to b

buckets, and accept the tuple for the sample if the hash value is less than a.

If the key consists of more than one component, the hash function needs to

combine the values for those components to make a single hash-value.

The

Mining of Massive Datasets

Search WWH ::

Custom Search

Home