Mining Data Streams - Mining of Massive Datasets

Databases Reference

In-Depth Information

Example 4.3 : Consider the running example of Section 4.3.1. We can use

the above calculation to get the true expected number of 1's in the bit array.

Think of each bit as a target, and each member of S as a dart. Then the

probability that a given bit will be 1 is the probability that the corresponding

target will be hit by one or more darts. Since there are one billion members of

S, we have y = 10 9 darts. As there are eight billion bits, there are x = 8×10 9

targets. Thus, the probability that a given target is not hit is e −y/x = e −1/8

and the probability that it is hit is 1−e −1/8 . That quantity is about 0.1175.

In Section 4.3.1 we suggested that 1/8 = 0.125 is a good approximation, which

it is, but now we have the exact calculation.

2

We can apply the rule to the more general situation, where set S has m

members, the array has n bits, and there are k hash functions. The number

of targets is x = n, and the number of darts is y = km. Thus, the probability

that a bit remains 0 is e −km/n . We want the fraction of 0 bits to be fairly

large, or else the probability that a nonmember of S will hash at least once to

a 0 becomes too small, and there are too many false positives. For example,

we might choose k, the number of hash functions to be n/m or less. Then the

probability of a 0 is at least e −1 or 37%. In general, the probability of a false

positive is the probability of a 1 bit, which is 1−e −km/n , raised to the kth

power, i.e., (1−e −km/n ) k .

Example 4.4 : In Example 4.3 we found that the fraction of 1's in the array of

our running example is 0.1175, and this fraction is also the probability of a false

positive. That is, a nonmember of S will pass through the filter if it hashes to

a 1, and the probability of it doing so is 0.1175.

Suppose we used the same S and the same array, but used two different

hash functions. This situation corresponds to throwing two billion darts at

eight billion targets, and the probability that a bit remains 0 is e −1/4 . In order

to be a false positive, a nonmember of S must hash twice to bits that are 1,

and this probability is (1−e −1/4 ) 2 , or approximately 0.0493. Thus, adding a

second hash function for our running example is an improvement, reducing the

false-positive rate from 0.1175 to 0.0493.

2

4.3.4

Exercises for Section 4.3

Exercise 4.3.1 : For the situation of our running example (8 billion bits, 1

billion members of the set S), calculate the false-positive rate if we use three

hash functions? What if we use four hash functions?

! Exercise 4.3.2 : Suppose we have n bits of memory available, and our set S

has m members. Instead of using k hash functions, we could divide the n bits

into k arrays, and hash once to each array. As a function of n, m, and k, what

is the probability of a false positive? How does it compare with using k hash

functions into a single array?

Mining of Massive Datasets

Search WWH ::

Custom Search

Home