Advanced MapReduce - Hadoop in Action

Databases Reference

In-Depth Information

( continued

)

product using Bloom filters is the open source distributed web proxy called Squid.

Squid caches frequently accessed web content to save bandwidth and give users a

faster web experience. In a cluster of Squid servers, each one can cache a different

set of content. An incoming request should be routed to the Squid server holding a

copy of the requested content, or in case of a cache miss, the request is passed on

to the originating server. The routing mechanism needs to know what each of the

Squid servers contains. As sending a list of URLs for each Squid server and storing

it in memory is expensive, Bloom filters are used. A false positive means a request

is forwarded to the wrong Squid server, but that server would recognize it as a cache

miss and pass it on to the originating server, ensuring the correctness of the overall

operation. The small performance hit from a false positive is far outweighed by the

overall improvement.

Sharding systems are a similar application but more advanced. In a nutshell,

database sharding is the partitioning of a database across multiple machines such

that each machine only has to deal with a subset of records. Each record has some

ID that determines which machine it's assigned to. In more basic designs, the ID

is hashed statically to one of a fixed number of database machines. This approach

is inflexible to adding more shards or rebalancing existing ones. To add flexibility, it

uses a dynamic look-up for each record ID, but unfortunately that adds processing

delay if the look-up is done through a database (i.e., using disk). Like Squid, more

advanced shard systems use in-memory Bloom filters as a fast look-up. It needs

some mechanism to handle false positives, but the occurrence is small enough to

not impact the overall performance improvement.

For online display ad networks, it's important to be able to target an ad from the right

category to a visitor. Given the volume of traffic a typical ad network receives and the

latency requirements, one can end up spending a lot of money on hardware to have

the capability of retrieving the category in real time. A design based on Bloom filters

can dramatically decrease that cost. Use an offline process to tag web pages (or

visitors) on a limited number of categories (sports, family-oriented, music, etc.). Build

a Bloom filter for each category and store it in memory at the ad servers. When an

ad request arrives, the ad servers can quickly and cheaply determine which category

of ads to show. The amount of false positives is negligible.

5.3.2

Implementing a Bloom filter

Conceptually the implementation of a Bloom filter is quite straightforward. We describe

its implementation in a single system first before implementing it using Hadoop in a

distributed way. The internal representation of a Bloom filter is a bit array of size m . We

have k independent hash functions, where each hash function takes an object as input

and outputs an integer between 0 and m -1. We use the integer output as an index into

the bit array. When we “add” an element to the Bloom filter, we use the hash functions

to generate k indexes into the bit array. We set the k bits to 1. Figure 5.3 shows what

happens when we add several objects ( x , y , and z ) over time, in a Bloom filter that uses

three hash functions. Note that a bit will be set to 1 regardless of its previous state. The

number of 1s in the bit array can only grow.

Hadoop in Action

Search WWH ::

Custom Search

Home