Database Reference
In-Depth Information
Another way to combine estimates is to take the median of all estimates. The median is
not affected by the occasional outsized value of 2 R , so the worry described above for the av-
erage should not carry over to the median. Unfortunately, the median suffers from another
defect: it is always a power of 2. Thus, no matter how many hash functions we use, should
the correct value of m be between two powers of 2, say 400, then it will be impossible to
obtain a close estimate.
There is a solution to the problem, however. We can combine the two methods. First,
group the hash functions into small groups, and take their average. Then, take the median of
the averages. It is true that an occasional outsized 2 R will bias some of the groups and make
them too large. However, taking the median of group averages will reduce the influence of
this effect almost to nothing. Moreover, if the groups themselves are large enough, then the
averages can be essentially any number, which enables us to approach the true value m as
long as we use enough hash functions. In order to guarantee that any possible average can
be obtained, groups should be of size at least a small multiple of log 2 m .
4.4.4
Space Requirements
Observe that as we read the stream it is not necessary to store the elements seen. The only
thing we need to keep in main memory is one integer per hash function; this integer re-
cords the largest tail length seen so far for that hash function and any stream element. If we
are processing only one stream, we could use millions of hash functions, which is far more
than we need to get a close estimate. Only if we are trying to process many streams at the
same time would main memory constrain the number of hash functions we could associate
with any one stream. In practice, the time it takes to compute hash values for each stream
element would be the more significant limitation on the number of hash functions we use.
4.4.5
Exercises for Section 4.4
EXERCISE 4.4.1 Suppose our stream consists of the integers 3, 1, 4, 1, 5, 9, 2, 6, 5. Our hash
functions will all be of the form h ( x ) = ax + b mod 32 for some a and b . You should treat
the result as a 5-bit binary integer. Determine the tail length for each stream element and
the resulting estimate of the number of distinct elements if the hash function is:
(a) h ( x ) = 2 x + 1 mod 32.
(b) h ( x ) = 3 x + 7 mod 32.
(c) h ( x ) = 4 x mod 32.
Search WWH ::




Custom Search