Database Reference
In-Depth Information
To take a sample of size a / b , we hash the key value for each tuple to b buckets, and ac-
cept the tuple for the sample if the hash value is less than a . If the key consists of more
than one component, the hash function needs to combine the values for those components
to make a single hash-value. The result will be a sample consisting of all tuples with certain
key values. The selected key values will be approximately a / b of all the key values appear-
ing in the stream.
4.2.4
Varying the Sample Size
Often, the sample will grow as more of the stream enters the system. In our running ex-
ample, we retain all the search queries of the selected 1/10th of the users, forever. As time
goes on, more searches for the same users will be accumulated, and new users that are se-
lected for the sample will appear in the stream.
If we have a budget for how many tuples from the stream can be stored as the sample,
then the fraction of key values must vary, lowering as time goes on. In order to assure that
at all times, the sample consists of all tuples from a subset of the key values, we choose a
hash function h from key values to a very large number of values 0 , 1 , . . . , B − 1. We main-
tain a threshold t , which initially can be the largest bucket number, B − 1. At all times, the
sample consists of those tuples whose key K satisfies h ( K ) ≤ t . New tuples from the stream
are added to the sample if and only if they satisfy the same condition.
If the number of stored tuples of the sample exceeds the allotted space, we lower t to t
− 1 and remove from the sample all those tuples whose key K hashes to t . For efficiency,
we can lower t by more than 1, and remove the tuples with several of the highest hash val-
ues, whenever we need to throw some key values out of the sample. Further efficiency is
obtained by maintaining an index on the hash value, so we can find all those tuples whose
keys hash to a particular value quickly.
4.2.5
Exercises for Section 4.2
EXERCISE 4.2.1 Suppose we have a stream of tuples with the schema
Grades(university, courseID, studentID, grade)
Assume universities are unique, but a courseID is unique only within a university (i.e., dif-
ferent universities may have different courses with the same ID, e.g., “CS101”) and like-
wise, studentID's are unique only within a university (different universities may assign the
same ID to different students). Suppose we want to answer certain queries approximately
Search WWH ::




Custom Search