Statistical Approximation of Streaming Data - Real-Time Analytics

Database Reference

In-Depth Information

reservoir[N] = (Object)arg0;

} else {

long k = ( long )Math. floor (N*rng.nextDouble());

if (k < ( long )reservoir.length) reservoir[( int )k]

= (Object)arg0;

}

N++;

return true ;

}

This is effectively the same thing that the Durstenfeld version of the

Fisher-Yates algorithm did when it moved backward throughout the array.

The proof is not given here, but this ensures that each element in the stream

has an equal probability of being in the sample at every point in the analysis.

Biased Streaming Sampling

The basic reservoir algorithm, also known as Algorithm R (the name given

in Knuth's “The Art of Computer Programming”), ensures a uniform sample

from the beginning of time until the time the stream stops. This is largely

due to its original application, which was to sample from a “stream” in the

sense of only being allowed a single pass over an otherwise finite dataset.

This makes the algorithm well suited to providing samples in a batch setting

where Map-Reduce is being used to process a finite dataset.

It is less interesting in the streaming sense used by this topic, which is

more interested in the dynamics of the sample over time. In fact, as the

sample ages, it is less and less likely that a recent data point will actually

be included in the sample (the probability that a random number between

[0,N] is smaller than n gets smaller as N increases). Instead, it would be

preferable if the algorithm could be biased toward including newer events in

preference to older events.

Sliding Window Reservoir Sampling

A simple method for introducing time dynamics into reservoir sampling is

to maintain a sliding window of samples. In this implementation, a number

of reservoir “buckets” are maintained in a linked list:

public class SlidingWindow<E> {

int n,k,max;

Search WWH ::

Custom Search

Home