Approximating Streaming Data with Sketching - Real-Time Analytics

Database Reference

In-Depth Information

collision of one hash function might be too high, the probability that several

hashes will collide is small enough to be useful.

Clearly, this approach tends to overestimate the approximate frequency of

a value under the Cash Register model. Another element could (and will,

when enough elements are entered) increment the same counter, much like

the Bloom Filter. So, how badly does the Count-Min sketch overestimate

this count?

In the original Count-Min paper, the authors show that the probability that

the estimate of an element is between its true value x and an upper bound

x + em, where m is the number of elements entered into a map, is larger

than 1-d under two conditions. The first condition is that the width of the

sketch is 2/e. The second condition is that the depth of the sketch is (log

1/d)/log 2. So, a Count-Min sketch where the estimate is within 5 percent

of the sum with a 99 percent probability would have a width of 40 and a

depth of 7. A depth of 8 with a width of 128 would have a relative error of

approximately 1.5 percent with a probability of approximately 99.6 percent.

Using 32-bit counters, the Count-Min sketch requires the same amount of

space as a HyperLogLog sketch with b=12.

Count-Min Sketch Implementation

Ratherthanimplementing a Set<E> ,theCount-Min sketchimplements the

Java class Map<E,Long> albeit with some restrictions.

public class CountMinSketch<E extends Serializable>

implements Map<E,Long> {

int width = 0;

int depth = 0;

byte[] m;

SerializableHasher[] hashes;

ObjectOutputStream[] outputs;

protected void initialize(int m,int[] seeds) throws

IOException {

hashes = new SerializableHasher[seeds.length];

outputs= new ObjectOutputStream[seeds.length];

for(int i=0;i<seeds.length;i++) {

hashes[i] = (new

SerializableHasher()).seed(seeds[i]);

Search WWH ::

Custom Search

Home