Advanced MapReduce - Hadoop in Action

Databases Reference

In-Depth Information

The major benefit of a Bloom filter is that its size, in number of bits, is constant and is

set upon initialization. Adding more elements to a Bloom filter doesn't increase its size.

It only increases the false positive rate. A Bloom filter also has another configuration

parameter to denote the number of hash functions it uses. We'll discuss the reason for

this parameter and how the hash functions are used later when we discuss the Bloom

filter's implementation. For now, its main implication is that it affects the false positive

rate. The false positive rate is approximated by the equation

(1 - exp(-kn/m))k

where k is the number of hash functions used, m is the number of bits used to store the

Bloom filter, and n is the number of elements to be added to the Bloom filter. In practice,

m and n are determined by the requirement of the system, and therefore, k is chosen to

minimize the false positive rate given m and n , which (after a little calculus) is

k = ln(2) * (m/n)

≈

0.7 * (m/n)

The false positive rate with the given k is 0.6185 m/n , and k has to be an integer. The

false positive rate will only be an approximation. From a design point of view, one

should think in terms of ( m/n ), number of bits per element, rather than m alone. For

example, we have to store a set containing ten million URLs ( n =10,000,000). Allocat-

ing 8 bits per URL ( m / n =8) will require a 10 MB Bloom filter ( m = 80,000,000 bits).

This Bloom filter will have a false positive rate of (0.6185)8, or about 2 percent. If we

were to implement the Set class by storing the raw URLs, and let's say the average URL

length was 100 bytes, we would have to use 1 GB. Bloom filter has shrunk the storage

requirement by 2 orders of magnitude at the expense of only a 2 percent false positive

rate! A slight increase in storage allocated to the Bloom filter will reduce the false posi-

tive rate further. At 10 bits per URL, the Bloom filter will take up 12.5 MB and have a

false positive rate of only 0.8 percent.

In summary, the signature of our Bloom filter class will look like the following:

class BloomFilter<E> {

public BloomFilter(int m, int k) { ... }

public void add(E obj) { ... }

public boolean contains(E obj) { ... }

}

More applications of the Bloom filter

The Bloom filter found its first successful applications back when memory was scarce.

One of its first uses was in spellchecking. Not being able to store a whole dictionary

in memory, spellcheckers used a Bloom filter representation (of the dictionary) to

catch most misspellings. As memory size grew and became cheaper, such space

consideration waned. Bloom filter usage is finding a resurgence in large-scale data-

intensive operations as data is fast outgrowing memory and bandwidth.

We've already seen commercial products, such as Oracle 11g, using Bloom filters

to join data across distributed databases. In the networking world, one successful

Search WWH ::

Custom Search

Home