Advanced MapReduce - Hadoop in Action

Databases Reference

In-Depth Information

whose key is not in the set CustomerID415. This is sometimes called a semijoin , taking

the terminology from the database world.

Last but not least, what if the file CustomerID415 is still too big to fit in memory? Or

maybe CustomerID415 does fit in memory but it's size makes replicating it across all

the mappers inefficient. This situation calls for a data structure called a Bloom filter. A

Bloom filter is a compact representation of a set that supports only the contain query.

(“Does this set contain this element?”) Furthermore, the query answer is not completely

accurate, but it's guaranteed to have no false negatives and a small probability of false

positives. The slight inaccuracy is the trade-off for the data structure's compactness. By

using a Bloom filter representation of CustomerID415, the mappers will pass through

all customers in the 415 area code. It still guarantees the correctness of the data join

algorithm. The Bloom filter will also pass a small portion of customers not in the 415

area code to the reduce phase. This is fine because those will be ignored in the reduce

phase. We'll still have improved performance by reducing dramatically the amount

of traffic shuffled across the network. The use of Bloom filters is in fact a standard

technique for joining in distributed databases, and it's used in commercial products

such as Oracle 11g. We'll describe Bloom filter and its other applications in more

details in the next section.

5.3

Creating a Bloom filter

If you use Hadoop for batch processing of large data sets, your data-intensive com-

puting needs probably include transaction-style processing as well. We won't cover all

the techniques for running real-time distributed data processing (caching, sharding,

etc.). They aren't necessarily Hadoop-related and are well beyond the scope of this

book. One lesser-known tool for real-time data processing is the Bloom filter, which

is a summary of a data set whose usage makes other data processing techniques more

efficient. When that data set is big, Hadoop is often called in to generate the Bloom

filter representation. As we mentioned earlier, a Bloom filter is also sometimes used for

data joining within Hadoop itself. As a data processing expert, you'll be well rewarded

to have the Bloom filter in your bag of tricks. In this section we'll explain this data

structure in more detail and we'll go through an online ad network example that will

build a Bloom filter using Hadoop.

5.3.1

What does a Bloom filter

do?

At its most basic, a Bloom filter object supports two methods: add() and contains() .

These two methods work in a similar way as in the Java Set interface. The method add()

adds an object to the set, and the method contains() returns a Boolean true/false

value denoting whether an object is in the set or not. But, for a Bloom filter, contains()

doesn't always give an accurate answer. It has no false negatives . If contains() returns

false, you can be sure that the set doesn't have the object queried. It does have a small

probability of false positives though. contains() can return true for some objects not in

the set. The probability of false positives depends on the number of elements in the set

and some configuration parameters of the Bloom filter itself.

Hadoop in Action

Search WWH ::

Custom Search

Home