Approximating Streaming Data with Sketching - Real-Time Analytics

Database Reference

In-Depth Information

Cardinality Estimation

Although there are more efficient methods of maintaining an estimate of a

set's cardinality, the Bloom Filter can also approximate cardinality.

The basic trick is to go back to the probability that a particular bit is set to

1. When choosing an optimal filter size and number of hash functions, the

probability that a bit was set was p = 1-e -kn/m .

Assuming that the probability of each bit being set to 1 is independent,

the register array can be considered m independent, identically distributed

Bernoulli trials. This is the same as flipping a coin m times and counting

how many coins come up heads.

After inserting n elements, the probability of a 1 can be estimated by the

number of 1 bits x divided by the total number of bits m. Plugging x/m into

the equation and solving for n yields:

A slightly better estimate can be obtained by not using the approximation

from the earlier analysis. This yields a somewhat more complicated

expression for n:

However, practically speaking, the difference between -m/k and 1/k ×

log(1-1/m) is going to be quite small.

The naïve implementation of the size() method of the Bloom Filter Set

simply uses BitSet 's native cardinality() method:

public int size() {

int X = bits.cardinality();

return (int) (Math. log (1 - (double)X/

(double)bits.length())/

(Math. log (1.0 - 1.0/

(double)bits.length())*(double)hashes.length));

}

Search WWH ::

Custom Search

Home