Databases Reference
In-Depth Information
4.8
Summary of Chapter 4
F
The Stream Data Model : This model assumes data arrives at a processing
engine at a rate that makes it infeasible to store everything in active
storage. One strategy to dealing with streams is to maintain summaries
of the streams, su cient to answer the expected queries about the data.
A second approach is to maintain a sliding window of the most recently
arrived data.
F
Sampling of Streams: To create a sample of a stream that is usable for
a class of queries, we identify a set of key attributes for the stream. By
hashing the key of any arriving stream element, we can use the hash value
to decide consistently whether all or none of the elements with that key
will become part of the sample.
F
Bloom Filters: This technique allows us to filter streams so elements that
belong to a particular set are allowed through, while most nonmembers
are deleted. We use a large bit array, and several hash functions. Members
of the selected set are hashed to buckets, which are bits in the array, and
those bits are set to 1. To test a stream element for membership, we hash
the element to a set of bits using each of the hash functions, and only
accept the element if all these bits are 1.
F
Counting Distinct Elements: To estimate the number of different elements
appearing in a stream, we can hash elements to integers, interpreted as
binary numbers. 2 raised to the power that is the longest sequence of 0's
seen in the hash value of any stream element is an estimate of the number
of different elements. By using many hash functions and combining these
estimates, first by taking averages within groups, and then taking the
median of the averages, we get a reliable estimate.
F
Moments of Streams: The kth moment of a stream is the sum of the kth
powers of the counts of each element that appears at least once in the
stream. The 0th moment is the number of distinct elements, and the 1st
moment is the length of the stream.
F
Estimating Second Moments: A good estimate for the second moment, or
surprise number, is obtained by choosing a random position in the stream,
taking twice the number of times this element appears in the stream from
that position onward, subtracting 1, and multiplying by the length of
the stream. Many random variables of this type can be combined like
the estimates for counting the number of distinct elements, to produce a
reliable estimate of the second moment.
F
Estimating Higher Moments: The technique for second moments works
for kth moments as well, as long as we replace the formula 2x−1 (where
x is the number of times the element appears at or after the selected
position) by x k −(x−1) k .
Search WWH ::




Custom Search