Database Reference
In-Depth Information
measure is appropriate for a site like Amazon, where the typical user logs in with their
unique login name.
A similar problem is a Web site like Google that does not require login to issue a search
query, and may be able to identify users only by the IP address from which they send the
query. There are about 4 billion IP addresses, 2 sequences of four 8-bit bytes will serve as
the universal set in this case.
The obvious way to solve the problem is to keep in main memory a list of all the ele-
ments seen so far in the stream. Keep them in an efficient search structure such as a hash
table or search tree, so one can quickly add new elements and check whether or not the
element that just arrived on the stream was already seen. As long as the number of distinct
elements is not too great, this structure can fit in main memory and there is little problem
obtaining an exact answer to the question how many distinct elements appear in the stream.
However, if the number of distinct elements is too great, or if there are too many streams
that need to be processed at once (e.g., Yahoo! wants to count the number of unique users
viewing each of its pages in a month), then we cannot store the needed data in main
memory. There are several options. We could use more machines, each machine handling
only one or several of the streams. We could store most of the data structure in secondary
memory and batch stream elements so whenever we brought a disk block to main memory
there would be many tests and updates to be performed on the data in that block. Or we
could use the strategy to be discussed in this section, where we only estimate the number
of distinct elements but use much less memory than the number of distinct elements.
4.4.2
The Flajolet-Martin Algorithm
It is possible to estimate the number of distinct elements by hashing the elements of the
universal set to a bit-string that is sufficiently long. The length of the bit-string must be
sufficient that there are more possible results of the hash function than there are elements
of the universal set. For example, 64 bits is sufficient to hash URL's. We shall pick many
different hash functions and hash each element of the stream using these hash functions.
The important property of a hash function is that when applied to the same element, it al-
ways produces the same result. Notice that this property was also essential for the sampling
technique of Section 4.2 .
The idea behind the Flajolet-Martin Algorithm is that the more different elements we
see in the stream, the more different hash-values we shall see. As we see more different
hash-values, it becomes more likely that one of these values will be “unusual.” The par-
ticular unusual property we shall exploit is that the value ends in many 0s, although many
other options exist.
Search WWH ::




Custom Search