Approximating Streaming Data with Sketching - Real-Time Analytics

Database Reference

In-Depth Information

To improve the error, one approach would be to use the same approach

employed by the Bloom Filter, introducing m different hash functions to

produce m independent streams of data. Taking the arithmetic mean of

these m independentstreamsyieldsthesameexpectationasasinglestream,

but with a standard deviation proportional to thereby improving the

standard error. Using m independent hash functions is computationally

expensive, so Min-Count uses a technique called stochastic averaging ,

which was introduced by Flajolet in 1985 to simulate independent streams.

To simulate independent random streams, the first b bits of the hash value

are used to define 2 b registers where b is typically smaller than half the

number of bits used in the hash function. If each register that stores the

three smallest is normalized to be a number in [0,1] the cardinality estimate

becomes:

with a standard error of . To use a concrete example, a 32-bit hash function

with 65,536 registers would require 384k of storage. By contrast, a sampling

approach using the same 32-bit hash to reduce the storage requirements

would be able to store about 98,000 values in the same amount of space,

which may not be sufficient if some of the distinct values are quite rare and

the data stream very large.

Implementing Min-Count

Like the Bloom Filter implementation, the Min-Count implementation uses

the standard Java Set interface and works on serializable elements. This

implementation also uses the third smallest value to avoid the issues

mentioned in the previous section. Using a parameter b to define the

number of bits to use and defining 2b registers, the class is initialized as

follows:

public class MinCount<E extends Serializable>

implements Set<E> {

int b;

int[] M;

SerializableHasher hash = new SerializableHasher();

ObjectOutputStream out;

Search WWH ::

Custom Search

Home