Database Reference
In-Depth Information
Whenever we apply a hash function h to a stream element a , the bit string h ( a ) will end
in some number of 0s, possibly none. Call this number the tail length for a and h . Let R be
the maximum tail length of any a seen so far in the stream. Then we shall use estimate 2 R
for the number of distinct elements seen in the stream.
This estimate makes intuitive sense. The probability that a given stream element a has
h ( a ) ending in at least r 0s is 2 −r . Suppose there are m distinct elements in the stream. Then
the probability that none of them has tail length at least r is (1 − 2 −r ) m . This sort of ex-
pression should be familiar by now. We can rewrite it as ((1 − 2 −r ) 2 r ) m 2 −r . Assuming r is
reasonably large, the inner expression is of the form (1 − ϵ ) 1/ ϵ , which is approximately 1/ e .
Thus, the probability of not finding a stream element with as many as r 0s at the end of its
hash value is e −m 2 −r . We can conclude:
(1) If m is much larger than 2 r , then the probability that we shall find a tail of length at
least r approaches 1.
(2) If m is much less than 2 r , then the probability of finding a tail length at least r ap-
proaches 0.
We conclude from these two points that the proposed estimate of m , which is 2 R (recall R
is the largest tail length for any stream element) is unlikely to be either much too high or
much too low.
4.4.3
Combining Estimates
Unfortunately, there is a trap regarding the strategy for combining the estimates of m , the
number of distinct elements, that we obtain by using many different hash functions. Our
first assumption would be that if we take the average of the values 2 R that we get from each
hash function, we shall get a value that approaches the true m , the more hash functions we
use. However, that is not the case, and the reason has to do with the influence an overes-
timate has on the average.
Consider a value of r such that 2 r is much larger than m . There is some probability p that
we shall discover r to be the largest number of 0s at the end of the hash value for any of
the m stream elements. Then the probability of finding r + 1 to be the largest number of 0s
instead is at least p /2. However, if we do increase by 1 the number of 0s at the end of a hash
value, the value of 2 R doubles. Consequently, the contribution from each possible large R
to the expected value of 2 R grows as R grows, and the expected value of 2 R is actually in-
finite. 3
Search WWH ::




Custom Search