Approximating Streaming Data with Sketching - Real-Time Analytics

Database Reference

In-Depth Information

Although this estimator works well when the cardinality is in an

“intermediate” range, there are further corrections to be performed when

the cardinality is very small or large.

When the raw estimate is smaller than 5×m/2 there are likely to be registers

that have not been used. Using V as the count of unused registers, if V is

greater than 0 then N is better estimated by:

This is actually very similar to the result obtained by a different algorithm

called Linear Counting.

Similarly, when the cardinality gets very high, the introduction of hash

collisions will begin to undercount the number of distinct items. A

correction for this is introduced when the raw estimate exceeds 2 32 /30

(cardinalities of roughly 143 million):

Using these corrections, the relative error of the cardinality estimates is

approximately 1.04√(m).

Implementing HyperLogLog

The immediately obvious benefit of using HyperLogLog compared to

Min-Count is that the storage space required for the registers is much

smaller. Rather than having to store the smallest hash value, the algorithm

only needs to store the largest number of leading zeros. For a 32-bit hash

function, that number can only be as large as 32 and only requires 5 bits

to store rather than Min-Count's 32 bits for the same size hash function.

That works out to 85 percent storage right off the bat. Of course, it is usually

easier to just allocate 8 bits to each register, which only uses 75 percent less

storage.

The algorithm is well defined when between 2 4 and 2 16 registers are used,

so this is enforced when creating the object:

Search WWH ::

Custom Search

Home