Mining Data Streams - Mining of Massive Datasets

Databases Reference

In-Depth Information

result will be a sample consisting of all tuples with certain key values. The

selected key values will be approximately a/b of all the key values appearing in

the stream.

4.2.4

Varying the Sample Size

Often, the sample will grow as more of the stream enters the system. In our

running example, we retain all the search queries of the selected 1/10th of

the users, forever. As time goes on, more searches for the same users will be

accumulated, and new users that are selected for the sample will appear in the

stream.

If we have a budget for how many tuples from the stream can be stored as

the sample, then the fraction of key values must vary, lowering as time goes

on. In order to assure that at all times, the sample consists of all tuples from

a subset of the key values, we choose a hash function h from key values to a

very large number of values 0, 1, . . . , B−1. We maintain a threshold t, which

initially can be the largest bucket number, B−1. At all times, the sample

consists of those tuples whose key K satisfies h(K)≤t. New tuples from the

stream are added to the sample if and only if they satisfy the same condition.

If the number of stored tuples of the sample exceeds the allotted space, we

lower t to t−1 and remove from the sample all those tuples whose key K hashes

to t. For e ciency, we can lower t by more than 1, and remove the tuples with

several of the highest hash values, whenever we need to throw some key values

out of the sample. Further e ciency is obtained by maintaining an index on

the hash value, so we can find all those tuples whose keys hash to a particular

value quickly.

4.2.5

Exercises for Section 4.2

Exercise 4.2.1 : Suppose we have a stream of tuples with the schema

Grades(university, courseID, studentID, grade)

Assume universities are unique, but a courseID is unique only within a uni-

versity (i.e., different universities may have different courses with the same ID,

e.g., “CS101”) and likewise, studentID's are unique only within a university

(different universities may assign the same ID to different students). Suppose

we want to answer certain queries approximately from a 1/20th sample of the

data. For each of the queries below, indicate how you would construct the

sample. That is, tell what the key attributes should be.

(a) For each university, estimate the average number of students in a course.

(b) Estimate the fraction of students who have a GPA of 3.5 or more.

(c) Estimate the fraction of courses where at least half the students got “A.”

Search WWH ::

Custom Search

Home