Database Reference
In-Depth Information
Sampling of Streams : To create a sample of a stream that is usable for a class of queries, we identify a set of key
attributes for the stream. By hashing the key of any arriving stream element, we can use the hash value to decide
consistently whether all or none of the elements with that key will become part of the sample.
Bloom Filters : This technique allows us to filter streams so elements that belong to a particular set are allowed
through, while most nonmembers are deleted. We use a large bit array, and several hash functions. Members of the
selected set are hashed to buckets, which are bits in the array, and those bits are set to 1. To test a stream element for
membership, we hash the element to a set of bits using each of the hash functions, and only accept the element if all
these bits are 1.
Counting Distinct Elements : To estimate the number of different elements appearing in a stream, we can hash ele-
ments to integers, interpreted as binary numbers. 2 raised to the power that is the longest sequence of 0s seen in the
hash value of any stream element is an estimate of the number of different elements. By using many hash functions
and combining these estimates, first by taking averages within groups, and then taking the median of the averages,
we get a reliable estimate.
Moments of Streams : The k th moment of a stream is the sum of the k th powers of the counts of each element that
appears at least once in the stream. The 0th moment is the number of distinct elements, and the 1st moment is the
length of the stream.
Estimating Second Moments : A good estimate for the second moment, or surprise number, is obtained by choosing
a random position in the stream, taking twice the number of times this element appears in the stream from that po-
sition onward, subtracting 1, and multiplying by the length of the stream. Many random variables of this type can
be combined like the estimates for counting the number of distinct elements, to produce a reliable estimate of the
second moment.
Estimating Higher Moments : The technique for second moments works for k th moments as well, as long as we re-
place the formula 2 x − 1 (where x is the number of times the element appears at or after the selected position) by x k
− ( x − 1) k .
Estimating the Number of 1s in a Window : We can estimate the number of 1s in a window of 0s and 1s by grouping
the 1s into buckets. Each bucket has a number of 1s that is a power of 2; there are one or two buckets of each size,
and sizes never decrease as we go back in time. If we record only the position and size of the buckets, we can rep-
resent the contents of a window of size N with O (log 2 N ) space.
Answering Queries About Numbers of 1s : If we want to know the approximate numbers of 1s in the most recent k
elements of a binary stream, we find the earliest bucket B that is at least partially within the last k positions of the
window and estimate the number of 1s to be the sum of the sizes of each of the more recent buckets plus half the
size of B . This estimate can never be off by more that 50% of the true count of 1s.
Closer Approximations to the Number of 1s : By changing the rule for how many buckets of a given size can exist
in the representation of a binary window, so that either r or r − 1 of a given size may exist, we can assure that the
approximation to the true number of 1s is never off by more than 1/ r .
Exponentially Decaying Windows : Rather than fixing a window size, we can imagine that the window consists of
all the elements that ever arrived in the stream, but with the element that arrived t time units ago weighted by e −ct
for some time-constant c . Doing so allows us to maintain certain summaries of an exponentially decaying window
easily. For instance, the weighted sum of elements can be recomputed, when a new element arrives, by multiplying
the old sum by 1 − c and then adding the new element.
Maintaining Frequent Elements in an Exponentially Decaying Window : We can imagine that each item is represen-
ted by a binary stream, where 0 means the item was not the element arriving at a given time, and 1 means that it was.
We can find the elements whose sum of their binary stream is at least 1/2. When a new element arrives, multiply
all recorded sums by 1 minus the time constant, add 1 to the count of the item that just arrived, and delete from the
record any item whose sum has fallen below 1/2.
Search WWH ::




Custom Search