Database Reference
In-Depth Information
email address hashes is 1, then we let the email through. But if the email address hashes to
a 0, we are certain that the address is not in
S
, so we can drop this stream element.
Unfortunately, some spam email will get through. Approximately 1/8th of the stream
elements whose email address is not in
S
will happen to hash to a bit whose value is 1 and
will be let through. Nevertheless, since the majority of emails are spam (about 80% accord-
ing to some reports), eliminating 7/8th of the spam is a significant benefit. Moreover, if we
want to eliminate every spam, we need only check for membership in
S
those good and bad
emails that get through the filter. Those checks will require the use of secondary memory
to access
S
itself. There are also other options, as we shall see when we study the general
Bloom-filtering technique. As a simple example, we could use a cascade of filters, each of
which would eliminate 7/8th of the remaining spam.
4.3.2
The Bloom Filter
A
Bloom filter
consists of:
(1) An array of
n
bits, initially all 0s.
(2) A collection of hash functions
h
1
, h
2
, . . . , h
k
. Each hash function maps “key” values
to
n
buckets, corresponding to the
n
bits of the bit-array.
(3) A set
S
of
m
key values.
The purpose of the Bloom filter is to allow through all stream elements whose keys are in
S
, while rejecting most of the stream elements whose keys are not in
S
.
To initialize the bit array, begin with all bits 0. Take each key value in
S
and hash it using
each of the
k
hash functions. Set to 1 each bit that is
h
i
(
K
) for some hash function
h
i
and
some key value
K
in
S
.
To test a key
K
that arrives in the stream, check that all of
h
1
(
K
)
, h
2
(
K
)
, . . . , h
k
(
K
)
are 1s in the bit-array. If all are 1s, then let the stream element through. If one or more of
these bits are 0, then
K
could not be in
S
, so reject the stream element.
4.3.3
Analysis of Bloom Filtering
If a key value is in
S
, then the element will surely pass through the Bloom filter. However,
if the key value is not in
S
, it might still pass. We need to understand how to calculate