Database Reference
In-Depth Information
advertising, can see cardinalities on the order of 100 million and perhaps 1
billion at the extremes.
Google,ontheotherhand,routinelydealswithcardinalitieswellinexcessof
1billion.Todealwiththis,Googlehasintroducedanumberofmodifications
to HyperLogLog it calls HyperLogLog++ and presented in a paper in early
2013 (the paper also mentions that Google had previously been a heavy user
of the Min-Count algorithm).
To use HyperLogLog with very large cardinalities, Google swaps the 32-bit
hash function for a 64-bit hash function. It uses a proprietary hash function
known only to Google, but there are 64-bit variants of the MurmurHash
function used in this implementation.
This change has two effects. First, more space is required to minimally store
the registers since 6 bits will be required instead of the original 5. The
implementation presented in the previous section is sub-optimal and uses 8
bits to represent the counter so the changes have no effect.
Second, the probability of hash collisions is much smaller than the original
32-bit hash function so the large range correction is no longer needed for
practical cardinalities. If the cardinalities were to approach these levels, the
authors suggest that better results are likely to be had by simply making the
hash function larger. Doubling the size of the hash function only increases
the storage space by 1 bit. The implementation given could use 256-bit hash
functions without requiring any extra storage.
This still leaves the small cardinality corrections, which actually use an
entirely different algorithm! To improve the bias when the cardinality is
small, Google conducted an extensive empirical investigation for known
cardinalities. This resulted in a set of interpolation points that can be used
when cardinalities are small. The size() method then becomes:
public int size () {
double N = alpha/Z;
if(N < 5.0*m) N -= estimateBias(N);
double H = (V > 0) ?
-Math. pow (2, 32)*Math. log (1.0 - N/Math. pow (2, 32)) : N;
return (int)(H < threshold(b) ? H : N);
}
Search WWH ::




Custom Search