Mining Data Streams - Mining of Massive Datasets

Database Reference

In-Depth Information

There is a pattern to the distribution of bucket sizes as we execute the basic algorithm of Section 4.6.5 . Think of two

buckets of size 2 j as a ”1” in position j and one bucket of size 2 j as a ”0” in that position. Then as 1s arrive in the

stream, the bucket sizes after each 1 form consecutive binary integers. The occasional long sequences of bucket com-

binations are analogous to the occasional long rippling of carries as we go from an integer like 101111 to 110000.

No matter what j is, this fraction is upper bounded by 1/( r −1). Thus, by picking r suffi-

ciently large, we can limit the error to any desired ϵ > 0.

4.6.7

Extensions to the Counting of Ones

It is natural to ask whether we can extend the technique of this section to handle aggrega-

tions more general than counting 1s in a binary stream. An obvious direction to look is to

consider streams of integers and ask if we can estimate the sum of the last k integers for

any 1 ≤ k ≤ N , where N , as usual, is the window size.

It is unlikely that we can use the DGIM approach to streams containing both positive

and negative integers. We could have a stream containing both very large positive integers

and very large negative integers, but with a sum in the window that is very close to 0. Any

imprecision in estimating the values of these large integers would have a huge effect on the

estimate of the sum, and so the fractional error could be unbounded.

For example, suppose we broke the stream into buckets as we have done, but represented

the bucket by the sum of the integers therein, rather than the count of 1s. If b is the bucket

that is partially within the query range, it could be that b has, in its first half, very large

negative integers and in its second half, equally large positive integers, with a sum of 0. If

we estimate the contribution of b by half its sum, that contribution is essentially 0. But the

actual contribution of that part of bucket b that is in the query range could be anything from

0 to the sum of all the positive integers. This difference could be far greater than the actual

query answer, and so the estimate would be meaningless.

On the other hand, some other extensions involving integers do work. Suppose that the

stream consists of only positive integers in the range 1 to 2 m for some m . We can treat each

of the m bits of each integer as if it were a separate stream.

We then use the DGIM method to count the 1s in each bit. Suppose the count of the i th

bit (assuming bits count from the low-order end, starting at 0) is c i . Then the sum of the

integers is

Search WWH ::

Custom Search

Home