Databases Reference
In-Depth Information
The major benefit of a Bloom filter is that its size, in number of bits, is constant and is
set upon initialization. Adding more elements to a Bloom filter doesn't increase its size.
It only increases the false positive rate. A Bloom filter also has another configuration
parameter to denote the number of hash functions it uses. We'll discuss the reason for
this parameter and how the hash functions are used later when we discuss the Bloom
filter's implementation. For now, its main implication is that it affects the false positive
rate. The false positive rate is approximated by the equation
(1 - exp(-kn/m))k
where k is the number of hash functions used, m is the number of bits used to store the
Bloom filter, and n is the number of elements to be added to the Bloom filter. In practice,
m and n are determined by the requirement of the system, and therefore, k is chosen to
minimize the false positive rate given m and n , which (after a little calculus) is
k = ln(2) * (m/n)
0.7 * (m/n)
The false positive rate with the given k is 0.6185 m/n , and k has to be an integer. The
false positive rate will only be an approximation. From a design point of view, one
should think in terms of ( m/n ), number of bits per element, rather than m alone. For
example, we have to store a set containing ten million URLs ( n =10,000,000). Allocat-
ing 8 bits per URL ( m / n =8) will require a 10 MB Bloom filter ( m = 80,000,000 bits).
This Bloom filter will have a false positive rate of (0.6185)8, or about 2 percent. If we
were to implement the Set class by storing the raw URLs, and let's say the average URL
length was 100 bytes, we would have to use 1 GB. Bloom filter has shrunk the storage
requirement by 2 orders of magnitude at the expense of only a 2 percent false positive
rate! A slight increase in storage allocated to the Bloom filter will reduce the false posi-
tive rate further. At 10 bits per URL, the Bloom filter will take up 12.5 MB and have a
false positive rate of only 0.8 percent.
In summary, the signature of our Bloom filter class will look like the following:
class BloomFilter<E> {
public BloomFilter(int m, int k) { ... }
public void add(E obj) { ... }
public boolean contains(E obj) { ... }
}
More applications of the Bloom filter
The Bloom filter found its first successful applications back when memory was scarce.
One of its first uses was in spellchecking. Not being able to store a whole dictionary
in memory, spellcheckers used a Bloom filter representation (of the dictionary) to
catch most misspellings. As memory size grew and became cheaper, such space
consideration waned. Bloom filter usage is finding a resurgence in large-scale data-
intensive operations as data is fast outgrowing memory and bandwidth.
We've already seen commercial products, such as Oracle 11g, using Bloom filters
to join data across distributed databases. In the networking world, one successful
 
 
Search WWH ::




Custom Search