Databases Reference
In-Depth Information
( continued
)
product using Bloom filters is the open source distributed web proxy called Squid.
Squid caches frequently accessed web content to save bandwidth and give users a
faster web experience. In a cluster of Squid servers, each one can cache a different
set of content. An incoming request should be routed to the Squid server holding a
copy of the requested content, or in case of a cache miss, the request is passed on
to the originating server. The routing mechanism needs to know what each of the
Squid servers contains. As sending a list of URLs for each Squid server and storing
it in memory is expensive, Bloom filters are used. A false positive means a request
is forwarded to the wrong Squid server, but that server would recognize it as a cache
miss and pass it on to the originating server, ensuring the correctness of the overall
operation. The small performance hit from a false positive is far outweighed by the
overall improvement.
Sharding systems are a similar application but more advanced. In a nutshell,
database sharding is the partitioning of a database across multiple machines such
that each machine only has to deal with a subset of records. Each record has some
ID that determines which machine it's assigned to. In more basic designs, the ID
is hashed statically to one of a fixed number of database machines. This approach
is inflexible to adding more shards or rebalancing existing ones. To add flexibility, it
uses a dynamic look-up for each record ID, but unfortunately that adds processing
delay if the look-up is done through a database (i.e., using disk). Like Squid, more
advanced shard systems use in-memory Bloom filters as a fast look-up. It needs
some mechanism to handle false positives, but the occurrence is small enough to
not impact the overall performance improvement.
For online display ad networks, it's important to be able to target an ad from the right
category to a visitor. Given the volume of traffic a typical ad network receives and the
latency requirements, one can end up spending a lot of money on hardware to have
the capability of retrieving the category in real time. A design based on Bloom filters
can dramatically decrease that cost. Use an offline process to tag web pages (or
visitors) on a limited number of categories (sports, family-oriented, music, etc.). Build
a Bloom filter for each category and store it in memory at the ad servers. When an
ad request arrives, the ad servers can quickly and cheaply determine which category
of ads to show. The amount of false positives is negligible.
5.3.2
Implementing a Bloom filter
Conceptually the implementation of a Bloom filter is quite straightforward. We describe
its implementation in a single system first before implementing it using Hadoop in a
distributed way. The internal representation of a Bloom filter is a bit array of size m . We
have k independent hash functions, where each hash function takes an object as input
and outputs an integer between 0 and m -1. We use the integer output as an index into
the bit array. When we “add” an element to the Bloom filter, we use the hash functions
to generate k indexes into the bit array. We set the k bits to 1. Figure 5.3 shows what
happens when we add several objects ( x , y , and z ) over time, in a Bloom filter that uses
three hash functions. Note that a bit will be set to 1 regardless of its previous state. The
number of 1s in the bit array can only grow.
 
 
Search WWH ::




Custom Search