Databases Reference
In-Depth Information
result will be a sample consisting of all tuples with certain key values. The
selected key values will be approximately a/b of all the key values appearing in
the stream.
4.2.4
Varying the Sample Size
Often, the sample will grow as more of the stream enters the system. In our
running example, we retain all the search queries of the selected 1/10th of
the users, forever. As time goes on, more searches for the same users will be
accumulated, and new users that are selected for the sample will appear in the
stream.
If we have a budget for how many tuples from the stream can be stored as
the sample, then the fraction of key values must vary, lowering as time goes
on. In order to assure that at all times, the sample consists of all tuples from
a subset of the key values, we choose a hash function h from key values to a
very large number of values 0, 1, . . . , B−1. We maintain a threshold t, which
initially can be the largest bucket number, B−1. At all times, the sample
consists of those tuples whose key K satisfies h(K)≤t. New tuples from the
stream are added to the sample if and only if they satisfy the same condition.
If the number of stored tuples of the sample exceeds the allotted space, we
lower t to t−1 and remove from the sample all those tuples whose key K hashes
to t. For e ciency, we can lower t by more than 1, and remove the tuples with
several of the highest hash values, whenever we need to throw some key values
out of the sample. Further e ciency is obtained by maintaining an index on
the hash value, so we can find all those tuples whose keys hash to a particular
value quickly.
4.2.5
Exercises for Section 4.2
Exercise 4.2.1 : Suppose we have a stream of tuples with the schema
Grades(university, courseID, studentID, grade)
Assume universities are unique, but a courseID is unique only within a uni-
versity (i.e., different universities may have different courses with the same ID,
e.g., “CS101”) and likewise, studentID's are unique only within a university
(different universities may assign the same ID to different students). Suppose
we want to answer certain queries approximately from a 1/20th sample of the
data. For each of the queries below, indicate how you would construct the
sample. That is, tell what the key attributes should be.
(a) For each university, estimate the average number of students in a course.
(b) Estimate the fraction of students who have a GPA of 3.5 or more.
(c) Estimate the fraction of courses where at least half the students got “A.”
Search WWH ::




Custom Search