Database Reference
In-Depth Information
EXAMPLE 4.2 Web sites often like to report the number of unique users over the past month.
If we think of each login as a stream element, we can maintain a window that is all lo-
gins in the most recent month. We must associate the arrival time with each login, so we
know when it no longer belongs to the window. If we think of the window as a relation
Logins(name, time) , then it is simple to get the number of unique users over the past
month. The SQL query is:
SELECT COUNT(DISTINCT(name))
FROM Logins
WHERE time >= t;
Here, t is a constant that represents the time one month before the current time.
Note that we must be able to maintain the entire stream of logins for the past month in
working storage. However, for even the largest sites, that data is not more than a few tera-
bytes, and so surely can be stored on disk.
4.1.4
Issues in Stream Processing
Before proceeding to discuss algorithms, let us consider the constraints under which we
work when dealing with streams. First, streams often deliver elements very rapidly. We
must process elements in real time, or we lose the opportunity to process them at all,
without accessing the archival storage. Thus, it often is important that the stream-process-
ing algorithm is executed in main memory, without access to secondary storage or with
only rare accesses to secondary storage. Moreover, even when streams are “slow,” as in the
sensor-data example of Section 4.1.2 , there may be many such streams. Even if each stream
by itself can be processed using a small amount of main memory, the requirements of all
the streams together can easily exceed the amount of available main memory.
Thus, many problems about streaming data would be easy to solve if we had enough
memory, but become rather hard and require the invention of new techniques in order to
execute them at a realistic rate on a machine of realistic size. Here are two generalizations
about stream algorithms worth bearing in mind as you read through this chapter:
• Often, it is much more efficient to get an approximate answer to our problem than
an exact solution.
• As in Chapter 3 , a variety of techniques related to hashing turn out to be useful.
Generally, these techniques introduce useful randomness into the algorithm's be-
havior, in order to produce an approximate answer that is very close to the true
result.
Search WWH ::




Custom Search