Approximating Streaming Data with Sketching - Real-Time Analytics

Database Reference

In-Depth Information

Choosing a Filter Size

The likelihood that the scenario mentioned at the end of the last section

occurs is clearly a function of three parameters: the number of elements

inserted into the set (n), the size of the bit set itself (m), and the number

of hash functions used to set bits (k). The algorithm's user selects the latter

two parameters, and the former can typically be estimated. This section

discusses the selection of these two parameters and their effect on the false

positive rate.

From before, to correctly identify an element as not being in the set, at least

one of the bits identified by the k hash functions must be zero. If the hash

functions are independent and uniformly distributed, the probability p that

a bit will still be zero after n elements have been inserted into the filter is

p=(1-1/m) kn . This equation is approximated by p=e -kn/m . The probability of

a false positive is approximately (1-p) k , the chance that all k registers being

checked will be set. This yields a final equation of (1-e -kn/m ) k .

The goal is to make p as small as possible, which happens when k = (m/

n) × ln 2. The number of hashes to use then depends on the size of the

filter and an idea of how many items will be entered into the Bloom Filter.

Substituting this into the equation above yields:

Solving for m yields the equation:

Setting n to 1, the number of bits required to store a single value is given

as approximately 2.08*ln(1/p) or 1.44*log2(1/p). From this approximation,

a false positive rate of 5 percent requires 6.22 bits of storage per element

to be entered into the set. Halving that rate only increases the storage per

element to 7.7 bits and a false positive rate of 1 percent only requires 9.6 bits

per storage.

Of course, register arrays must be allocated in a round number of bits. So,

rather than specify the error rate, it is also common to specify a number of

Search WWH ::

Custom Search

Home