Mining Data Streams - Mining of Massive Datasets

Database Reference

In-Depth Information

To take a sample of size a / b , we hash the key value for each tuple to b buckets, and ac-

cept the tuple for the sample if the hash value is less than a . If the key consists of more

than one component, the hash function needs to combine the values for those components

to make a single hash-value. The result will be a sample consisting of all tuples with certain

key values. The selected key values will be approximately a / b of all the key values appear-

ing in the stream.

4.2.4

Varying the Sample Size

Often, the sample will grow as more of the stream enters the system. In our running ex-

ample, we retain all the search queries of the selected 1/10th of the users, forever. As time

goes on, more searches for the same users will be accumulated, and new users that are se-

lected for the sample will appear in the stream.

If we have a budget for how many tuples from the stream can be stored as the sample,

then the fraction of key values must vary, lowering as time goes on. In order to assure that

at all times, the sample consists of all tuples from a subset of the key values, we choose a

hash function h from key values to a very large number of values 0 , 1 , . . . , B − 1. We main-

tain a threshold t , which initially can be the largest bucket number, B − 1. At all times, the

sample consists of those tuples whose key K satisfies h ( K ) ≤ t . New tuples from the stream

are added to the sample if and only if they satisfy the same condition.

If the number of stored tuples of the sample exceeds the allotted space, we lower t to t

− 1 and remove from the sample all those tuples whose key K hashes to t . For efficiency,

we can lower t by more than 1, and remove the tuples with several of the highest hash val-

ues, whenever we need to throw some key values out of the sample. Further efficiency is

obtained by maintaining an index on the hash value, so we can find all those tuples whose

keys hash to a particular value quickly.

4.2.5

Exercises for Section 4.2

EXERCISE 4.2.1 Suppose we have a stream of tuples with the schema

Grades(university, courseID, studentID, grade)

Assume universities are unique, but a courseID is unique only within a university (i.e., dif-

ferent universities may have different courses with the same ID, e.g., “CS101”) and like-

wise, studentID's are unique only within a university (different universities may assign the

same ID to different students). Suppose we want to answer certain queries approximately

Search WWH ::

Custom Search

Home