Mining Data Streams - Mining of Massive Datasets

Database Reference

In-Depth Information

measure is appropriate for a site like Amazon, where the typical user logs in with their

unique login name.

A similar problem is a Web site like Google that does not require login to issue a search

query, and may be able to identify users only by the IP address from which they send the

query. There are about 4 billion IP addresses, 2 sequences of four 8-bit bytes will serve as

the universal set in this case.

□

The obvious way to solve the problem is to keep in main memory a list of all the ele-

ments seen so far in the stream. Keep them in an efficient search structure such as a hash

table or search tree, so one can quickly add new elements and check whether or not the

element that just arrived on the stream was already seen. As long as the number of distinct

elements is not too great, this structure can fit in main memory and there is little problem

obtaining an exact answer to the question how many distinct elements appear in the stream.

However, if the number of distinct elements is too great, or if there are too many streams

that need to be processed at once (e.g., Yahoo! wants to count the number of unique users

viewing each of its pages in a month), then we cannot store the needed data in main

memory. There are several options. We could use more machines, each machine handling

only one or several of the streams. We could store most of the data structure in secondary

memory and batch stream elements so whenever we brought a disk block to main memory

there would be many tests and updates to be performed on the data in that block. Or we

could use the strategy to be discussed in this section, where we only estimate the number

of distinct elements but use much less memory than the number of distinct elements.

4.4.2

The Flajolet-Martin Algorithm

It is possible to estimate the number of distinct elements by hashing the elements of the

universal set to a bit-string that is sufficiently long. The length of the bit-string must be

sufficient that there are more possible results of the hash function than there are elements

of the universal set. For example, 64 bits is sufficient to hash URL's. We shall pick many

different hash functions and hash each element of the stream using these hash functions.

The important property of a hash function is that when applied to the same element, it al-

ways produces the same result. Notice that this property was also essential for the sampling

technique of Section 4.2 .

The idea behind the Flajolet-Martin Algorithm is that the more different elements we

see in the stream, the more different hash-values we shall see. As we see more different

hash-values, it becomes more likely that one of these values will be “unusual.” The par-

ticular unusual property we shall exploit is that the value ends in many 0s, although many

other options exist.

Search WWH ::

Custom Search

Home