Databases Reference
In-Depth Information
Example 4.3 : Consider the running example of Section 4.3.1. We can use
the above calculation to get the true expected number of 1's in the bit array.
Think of each bit as a target, and each member of S as a dart. Then the
probability that a given bit will be 1 is the probability that the corresponding
target will be hit by one or more darts. Since there are one billion members of
S, we have y = 10 9 darts. As there are eight billion bits, there are x = 8×10 9
targets. Thus, the probability that a given target is not hit is e −y/x = e −1/8
and the probability that it is hit is 1−e −1/8 . That quantity is about 0.1175.
In Section 4.3.1 we suggested that 1/8 = 0.125 is a good approximation, which
it is, but now we have the exact calculation.
2
We can apply the rule to the more general situation, where set S has m
members, the array has n bits, and there are k hash functions. The number
of targets is x = n, and the number of darts is y = km. Thus, the probability
that a bit remains 0 is e −km/n . We want the fraction of 0 bits to be fairly
large, or else the probability that a nonmember of S will hash at least once to
a 0 becomes too small, and there are too many false positives. For example,
we might choose k, the number of hash functions to be n/m or less. Then the
probability of a 0 is at least e −1 or 37%. In general, the probability of a false
positive is the probability of a 1 bit, which is 1−e −km/n , raised to the kth
power, i.e., (1−e −km/n ) k .
Example 4.4 : In Example 4.3 we found that the fraction of 1's in the array of
our running example is 0.1175, and this fraction is also the probability of a false
positive. That is, a nonmember of S will pass through the filter if it hashes to
a 1, and the probability of it doing so is 0.1175.
Suppose we used the same S and the same array, but used two different
hash functions. This situation corresponds to throwing two billion darts at
eight billion targets, and the probability that a bit remains 0 is e −1/4 . In order
to be a false positive, a nonmember of S must hash twice to bits that are 1,
and this probability is (1−e −1/4 ) 2 , or approximately 0.0493. Thus, adding a
second hash function for our running example is an improvement, reducing the
false-positive rate from 0.1175 to 0.0493.
2
4.3.4
Exercises for Section 4.3
Exercise 4.3.1 : For the situation of our running example (8 billion bits, 1
billion members of the set S), calculate the false-positive rate if we use three
hash functions? What if we use four hash functions?
! Exercise 4.3.2 : Suppose we have n bits of memory available, and our set S
has m members. Instead of using k hash functions, we could divide the n bits
into k arrays, and hash once to each array. As a function of n, m, and k, what
is the probability of a false positive? How does it compare with using k hash
functions into a single array?
Search WWH ::




Custom Search