Databases Reference
In-Depth Information
four positions, but only two of them are 1. Proceeding left, we see two buckets
of size 4, and we suggest that a bucket of size 8 exists further left.
Notice that it is OK for some 0's to lie between buckets. Also, observe from
Fig. 4.2 that the buckets do not overlap; there are one or two of each size up to
the largest size, and sizes only increase moving left.
2
In the next sections, we shall explain the following about the DGIM algo-
rithm:
1. Why the number of buckets representing a window must be small.
2. How to estimate the number of 1's in the last k bits for any k, with an
error no greater than 50%.
3. How to maintain the DGIM conditions as new bits enter the stream.
4.6.3 Storage Requirements for the DGIM Algorithm
We observed that each bucket can be represented by O(log n) bits. If the
window has length N , then there are no more than N 1's, surely. Suppose the
largest bucket is of size 2 j . Then j cannot exceed log 2 N , or else there are more
1's in this bucket than there are 1's in the entire window. Thus, there are at
most two buckets of all sizes from log 2 N down to 1, and no buckets of larger
sizes.
We conclude that there are O(log N ) buckets. Since each bucket can be
represented in O(log N ) bits, the total space required for all the buckets repre-
senting a window of size N is O(log 2 N ).
4.6.4 Query Answering in the DGIM Algorithm
Suppose we are asked how many 1's there are in the last k bits of the window,
for some 1≤k≤N . Find the bucket b with the highest timestamp that
includes at least some of the k most recent bits. Estimate the number of 1's to
be the sum of the sizes of all the buckets to the right (more recent) than bucket
b, plus half the size of b itself.
Example 4.12 : Suppose the stream is that of Fig. 4.2, and k = 10. Then the
query asks for the number of 1's in the ten rightmost bits, which happen to be
0110010110 . Let the current timestamp (time of the rightmost bit) be t. Then
the two buckets with one 1, having timestamps t−1 and t−2 are completely
included in the answer. The bucket of size 2, with timestamp t−4, is also
completely included. However, the rightmost bucket of size 4, with timestamp
t−8 is only partly included. We know it is the last bucket to contribute to the
answer, because the next bucket to its left has timestamp less than t−9 and
thus is completely out of the window. On the other hand, we know the buckets
to its right are completely inside the range of the query because of the existence
of a bucket to their left with timestamp t−9 or greater.
Search WWH ::




Custom Search