Database Reference
In-Depth Information
email address hashes is 1, then we let the email through. But if the email address hashes to
a 0, we are certain that the address is not in S , so we can drop this stream element.
Unfortunately, some spam email will get through. Approximately 1/8th of the stream
elements whose email address is not in S will happen to hash to a bit whose value is 1 and
will be let through. Nevertheless, since the majority of emails are spam (about 80% accord-
ing to some reports), eliminating 7/8th of the spam is a significant benefit. Moreover, if we
want to eliminate every spam, we need only check for membership in S those good and bad
emails that get through the filter. Those checks will require the use of secondary memory
to access S itself. There are also other options, as we shall see when we study the general
Bloom-filtering technique. As a simple example, we could use a cascade of filters, each of
which would eliminate 7/8th of the remaining spam.
4.3.2
The Bloom Filter
A Bloom filter consists of:
(1) An array of n bits, initially all 0s.
(2) A collection of hash functions h 1 , h 2 , . . . , h k . Each hash function maps “key” values
to n buckets, corresponding to the n bits of the bit-array.
(3) A set S of m key values.
The purpose of the Bloom filter is to allow through all stream elements whose keys are in
S , while rejecting most of the stream elements whose keys are not in S .
To initialize the bit array, begin with all bits 0. Take each key value in S and hash it using
each of the k hash functions. Set to 1 each bit that is h i ( K ) for some hash function h i and
some key value K in S .
To test a key K that arrives in the stream, check that all of
h 1 ( K ) , h 2 ( K ) , . . . , h k ( K )
are 1s in the bit-array. If all are 1s, then let the stream element through. If one or more of
these bits are 0, then K could not be in S , so reject the stream element.
4.3.3
Analysis of Bloom Filtering
If a key value is in S , then the element will surely pass through the Bloom filter. However,
if the key value is not in S , it might still pass. We need to understand how to calculate
Search WWH ::




Custom Search