Database Reference
In-Depth Information
Whenever we apply a hash function
h
to a stream element
a
, the bit string
h
(
a
) will end
in some number of 0s, possibly none. Call this number the
tail length
for
a
and
h
. Let
R
be
the maximum tail length of any
a
seen so far in the stream. Then we shall use estimate 2
R
for the number of distinct elements seen in the stream.
This estimate makes intuitive sense. The probability that a given stream element
a
has
h
(
a
) ending in at least
r
0s is 2
−r
. Suppose there are
m
distinct elements in the stream. Then
the probability that none of them has tail length at least
r
is (1 − 2
−r
)
m
. This sort of ex-
pression should be familiar by now. We can rewrite it as ((1 − 2
−r
)
2
r
)
m
2
−r
. Assuming
r
is
reasonably large, the inner expression is of the form (1 −
ϵ
)
1/
ϵ
, which is approximately 1/
e
.
Thus, the probability of not finding a stream element with as many as
r
0s at the end of its
hash value is
e
−m
2
−r
. We can conclude:
(1) If
m
is much larger than 2
r
, then the probability that we shall find a tail of length at
least
r
approaches 1.
(2) If
m
is much less than 2
r
, then the probability of finding a tail length at least
r
ap-
proaches 0.
We conclude from these two points that the proposed estimate of
m
, which is 2
R
(recall
R
is the largest tail length for any stream element) is unlikely to be either much too high or
much too low.
4.4.3
Combining Estimates
Unfortunately, there is a trap regarding the strategy for combining the estimates of
m
, the
number of distinct elements, that we obtain by using many different hash functions. Our
first assumption would be that if we take the average of the values 2
R
that we get from each
hash function, we shall get a value that approaches the true
m
, the more hash functions we
use. However, that is not the case, and the reason has to do with the influence an overes-
timate has on the average.
Consider a value of
r
such that 2
r
is much larger than
m
. There is some probability
p
that
we shall discover
r
to be the largest number of 0s at the end of the hash value for any of
the
m
stream elements. Then the probability of finding
r
+ 1 to be the largest number of 0s
instead is at least
p
/2. However, if we do increase by 1 the number of 0s at the end of a hash
value, the value of 2
R
doubles. Consequently, the contribution from each possible large
R
to the expected value of 2
R
grows as
R
grows, and the expected value of 2
R
is actually in-
finite.
3