Database Reference
In-Depth Information
weight on the current element is (1 − c ) 0 = 1, so adding a t +1 is the correct way to include
the new element's contribution.
4.7.3
Finding the Most Popular Elements
Let us return to the problem of finding the most popular movies in a stream of ticket sales. 6
We shall use an exponentially decaying window with a constant c , which you might think
of as 10 9 . That is, we approximate a sliding window holding the last one billion ticket
sales. For each movie, we imagine a separate stream with a 1 each time a ticket for that
movie appears in the stream, and a 0 each time a ticket for some other movie arrives. The
decaying sum of the 1s measures the current popularity of the movie.
We imagine that the number of possible movies in the stream is huge, so we do not want
to record values for the unpopular movies. Therefore, we establish a threshold, say 1/2, so
that if the popularity score for a movie goes below this number, its score is dropped from
the counting. For reasons that will become obvious, the threshold must be less than 1, al-
though it can be any number less than 1. When a new ticket arrives on the stream, do the
following:
(1) For each movie whose score we are currently maintaining, multiply its score by (1 −
c ).
(2) Suppose the new ticket is for movie M . If there is currently a score for M , add 1 to that
score. If there is no score for M , create one and initialize it to 1.
(3) If any score is below the threshold 1/2, drop that score.
It may not be obvious that the number of movies whose scores are maintained at any
time is limited. However, note that the sum of all scores is 1/ c . There cannot be more than
2/ c movies with score of 1/2 or more, or else the sum of the scores would exceed 1/ c . Thus,
2/ c is a limit on the number of movies being counted at any time. Of course in practice, the
ticket sales would be concentrated on only a small number of movies at any time, so the
number of actively counted movies would be much less than 2/ c .
4.8 Summary of Chapter 4
The Stream Data Model : This model assumes data arrives at a processing engine at a rate that makes it infeasible
to store everything in active storage. One strategy to dealing with streams is to maintain summaries of the streams,
sufficient to answer the expected queries about the data. A second approach is to maintain a sliding window of the
most recently arrived data.
Search WWH ::




Custom Search