Databases Reference
In-Depth Information
whose key is not in the set CustomerID415. This is sometimes called a semijoin , taking
the terminology from the database world.
Last but not least, what if the file CustomerID415 is still too big to fit in memory? Or
maybe CustomerID415 does fit in memory but it's size makes replicating it across all
the mappers inefficient. This situation calls for a data structure called a Bloom filter. A
Bloom filter is a compact representation of a set that supports only the contain query.
(“Does this set contain this element?”) Furthermore, the query answer is not completely
accurate, but it's guaranteed to have no false negatives and a small probability of false
positives. The slight inaccuracy is the trade-off for the data structure's compactness. By
using a Bloom filter representation of CustomerID415, the mappers will pass through
all customers in the 415 area code. It still guarantees the correctness of the data join
algorithm. The Bloom filter will also pass a small portion of customers not in the 415
area code to the reduce phase. This is fine because those will be ignored in the reduce
phase. We'll still have improved performance by reducing dramatically the amount
of traffic shuffled across the network. The use of Bloom filters is in fact a standard
technique for joining in distributed databases, and it's used in commercial products
such as Oracle 11g. We'll describe Bloom filter and its other applications in more
details in the next section.
5.3
Creating a Bloom filter
If you use Hadoop for batch processing of large data sets, your data-intensive com-
puting needs probably include transaction-style processing as well. We won't cover all
the techniques for running real-time distributed data processing (caching, sharding,
etc.). They aren't necessarily Hadoop-related and are well beyond the scope of this
book. One lesser-known tool for real-time data processing is the Bloom filter, which
is a summary of a data set whose usage makes other data processing techniques more
efficient. When that data set is big, Hadoop is often called in to generate the Bloom
filter representation. As we mentioned earlier, a Bloom filter is also sometimes used for
data joining within Hadoop itself. As a data processing expert, you'll be well rewarded
to have the Bloom filter in your bag of tricks. In this section we'll explain this data
structure in more detail and we'll go through an online ad network example that will
build a Bloom filter using Hadoop.
5.3.1
What does a Bloom filter
do?
At its most basic, a Bloom filter object supports two methods: add() and contains() .
These two methods work in a similar way as in the Java Set interface. The method add()
adds an object to the set, and the method contains() returns a Boolean true/false
value denoting whether an object is in the set or not. But, for a Bloom filter, contains()
doesn't always give an accurate answer. It has no false negatives . If contains() returns
false, you can be sure that the set doesn't have the object queried. It does have a small
probability of false positives though. contains() can return true for some objects not in
the set. The probability of false positives depends on the number of elements in the set
and some configuration parameters of the Bloom filter itself.
 
Search WWH ::




Custom Search