Mining Data Streams - Mining of Massive Datasets

Database Reference

In-Depth Information

EXAMPLE 4.2 Web sites often like to report the number of unique users over the past month.

If we think of each login as a stream element, we can maintain a window that is all lo-

gins in the most recent month. We must associate the arrival time with each login, so we

know when it no longer belongs to the window. If we think of the window as a relation

Logins(name, time) , then it is simple to get the number of unique users over the past

month. The SQL query is:

SELECT COUNT(DISTINCT(name))

FROM Logins

WHERE time >= t;

Here, t is a constant that represents the time one month before the current time.

Note that we must be able to maintain the entire stream of logins for the past month in

working storage. However, for even the largest sites, that data is not more than a few tera-

bytes, and so surely can be stored on disk.

□

4.1.4

Issues in Stream Processing

Before proceeding to discuss algorithms, let us consider the constraints under which we

work when dealing with streams. First, streams often deliver elements very rapidly. We

must process elements in real time, or we lose the opportunity to process them at all,

without accessing the archival storage. Thus, it often is important that the stream-process-

ing algorithm is executed in main memory, without access to secondary storage or with

only rare accesses to secondary storage. Moreover, even when streams are “slow,” as in the

sensor-data example of Section 4.1.2 , there may be many such streams. Even if each stream

by itself can be processed using a small amount of main memory, the requirements of all

the streams together can easily exceed the amount of available main memory.

Thus, many problems about streaming data would be easy to solve if we had enough

memory, but become rather hard and require the invention of new techniques in order to

execute them at a realistic rate on a machine of realistic size. Here are two generalizations

about stream algorithms worth bearing in mind as you read through this chapter:

• Often, it is much more efficient to get an approximate answer to our problem than

an exact solution.

• As in Chapter 3 , a variety of techniques related to hashing turn out to be useful.

Generally, these techniques introduce useful randomness into the algorithm's be-

havior, in order to produce an approximate answer that is very close to the true

result.

Search WWH ::

Custom Search

Home