Database Reference
In-Depth Information
Although this estimator works well when the cardinality is in an
“intermediate” range, there are further corrections to be performed when
the cardinality is very small or large.
When the raw estimate is smaller than 5×m/2 there are likely to be registers
that have not been used. Using V as the count of unused registers, if V is
greater than 0 then N is better estimated by:
This is actually very similar to the result obtained by a different algorithm
called Linear Counting.
Similarly, when the cardinality gets very high, the introduction of hash
collisions will begin to undercount the number of distinct items. A
correction for this is introduced when the raw estimate exceeds 2 32 /30
(cardinalities of roughly 143 million):
Using these corrections, the relative error of the cardinality estimates is
approximately 1.04√(m).
Implementing HyperLogLog
The immediately obvious benefit of using HyperLogLog compared to
Min-Count is that the storage space required for the registers is much
smaller. Rather than having to store the smallest hash value, the algorithm
only needs to store the largest number of leading zeros. For a 32-bit hash
function, that number can only be as large as 32 and only requires 5 bits
to store rather than Min-Count's 32 bits for the same size hash function.
That works out to 85 percent storage right off the bat. Of course, it is usually
easier to just allocate 8 bits to each register, which only uses 75 percent less
storage.
The algorithm is well defined when between 2 4 and 2 16 registers are used,
so this is enforced when creating the object:
Search WWH ::




Custom Search