Mining Data Streams - Mining of Massive Datasets

Database Reference

In-Depth Information

cess may ripple through the bucket sizes, but there are at most log 2 N different sizes, and

the combination of two adjacent buckets of the same size only requires constant time. As a

result, any new bit can be processed in O (log N ) time.

EXAMPLE 4.13 Suppose we start with the buckets of Fig. 4.2 and a 1 enters. First, the left-

most bucket evidently has not fallen out of the window, so we do not drop any buckets. We

create a new bucket of size 1 with the current timestamp, say t . There are now three buckets

of size 1, so we combine the leftmost two. They are replaced with a single bucket of size

2. Its timestamp is t − 2, the timestamp of the bucket on the right (i.e., the rightmost bucket

that actually appears in Fig. 4.2 .

There are now two buckets of size 2, but that is allowed by the DGIM rules. Thus, the

final sequence of buckets after the addition of the 1 is as shown in Fig. 4.3 .

□

Figure 4.3 Modified buckets after a new 1 arrives in the stream

4.6.6

Reducing the Error

Instead of allowing either one or two of each size bucket, suppose we allow either r − 1 or

r of each of the exponentially growing sizes 1 , 2 , 4 , . . . , for some integer r > 2. In order to

represent any possible number of 1s, we must relax this condition for the buckets of size 1

and buckets of the largest size present; there may be any number, from 1 to r , of buckets of

these sizes.

The rule for combining buckets is essentially the same as in Section 4.6.5 . If we get r +

1 buckets of size 2 j , combine the leftmost two into a bucket of size 2 j +1 . That may, in turn,

cause there to be r + 1 buckets of size 2 j +1 , and if so we continue combining buckets of

larger sizes.

The argument used in Section 4.6.4 can also be used here. However, because there are

more buckets of smaller sizes, we can get a stronger bound on the error. We saw there that

the largest relative error occurs when only one 1 from the leftmost bucket b is within the

query range, and we therefore overestimate the true count. Suppose bucket b is of size 2 j .

Then the true count is at least 1 + ( r − 1)(2 j− 1 + 2 j− 2 + · · · + 1) = 1 + ( r − 1)(2 j − 1). The

overestimate is 2 j− 1 − 1. Thus, the fractional error is

Bucket Sizes and Ripple-Carry Adders

Search WWH ::

Custom Search

Home