Approximating Streaming Data with Sketching - Real-Time Analytics

Database Reference

In-Depth Information

} else if(r > M[j]) {

Z -= Math. pow (2, -M[j]);

Z += Math. pow (2, -r);

M[j] = (byte)r;

return true;

}

return false;

} catch(IOException ex) {

return false;

}

Finally, the implementation of the size() method uses the baseline value

Z (what Flajolet calls the indicator function) to produce a corrected estimate

of the cardinality:

public int size() {

double N = alpha/Z;

if(N <= 5.0*m/2.0 && V > 0) {

return (int) (m*Math. log ((double)m/(double)V));

} else if(N > Math. pow (2, 32)/30.0){

return (int) (-Math. pow (2, 32)

*Math. log (1.0 - N/Math. pow (2, 32)));

} else

return (int)N;

}

Improvements to the HyperLogLog Algorithm

The HyperLogLog algorithm was originally designed for scenarios where

the expected cardinality of the set to be measured is on the order of a

billion. With this approximate cardinality in mind, HyperLogLog's creators

introduced a correction to the raw estimate, N, in the implementation of the

size calculation that is used as the raw estimate gets large. This correction

comes into play to account for the increasing probability of hash collisions

and fixes estimates for cardinalities in excess of 140 million or so.

Formostusers,thisrangeofcardinalityismorethansufficient.Applications

operating on the scale of a single website will see cardinalities on the order

of millions. Applications that deal with multiple websites, such as online

Search WWH ::

Custom Search

Home