Database Reference
In-Depth Information
Choosing a Filter Size
The likelihood that the scenario mentioned at the end of the last section
occurs is clearly a function of three parameters: the number of elements
inserted into the set (n), the size of the bit set itself (m), and the number
of hash functions used to set bits (k). The algorithm's user selects the latter
two parameters, and the former can typically be estimated. This section
discusses the selection of these two parameters and their effect on the false
positive rate.
From before, to correctly identify an element as not being in the set, at least
one of the bits identified by the k hash functions must be zero. If the hash
functions are independent and uniformly distributed, the probability p that
a bit will still be zero after n elements have been inserted into the filter is
p=(1-1/m) kn . This equation is approximated by p=e -kn/m . The probability of
a false positive is approximately (1-p) k , the chance that all k registers being
checked will be set. This yields a final equation of (1-e -kn/m ) k .
The goal is to make p as small as possible, which happens when k = (m/
n) × ln 2. The number of hashes to use then depends on the size of the
filter and an idea of how many items will be entered into the Bloom Filter.
Substituting this into the equation above yields:
Solving for m yields the equation:
Setting n to 1, the number of bits required to store a single value is given
as approximately 2.08*ln(1/p) or 1.44*log2(1/p). From this approximation,
a false positive rate of 5 percent requires 6.22 bits of storage per element
to be entered into the set. Halving that rate only increases the storage per
element to 7.7 bits and a false positive rate of 1 percent only requires 9.6 bits
per storage.
Of course, register arrays must be allocated in a round number of bits. So,
rather than specify the error rate, it is also common to specify a number of
Search WWH ::




Custom Search