Databases Reference
In-Depth Information
2. Add a t+1 .
The reason this method works is that each of the previous elements has now
moved one position further from the current element, so its weight is multiplied
by 1−c. Further, the weight on the current element is (1−c) 0 = 1, so adding
a t+1 is the correct way to include the new element's contribution.
4.7.3
Finding the Most Popular Elements
Let us return to the problem of finding the most popular movies in a stream of
ticket sales. 5 We shall use an exponentially decaying window with a constant
c, which you might think of as 10 −9 . That is, we approximate a sliding window
holding the last one billion ticket sales. For each movie, we imagine a separate
stream with a 1 each time a ticket for that movie appears in the stream, and a
0 each time a ticket for some other movie arrives. The decaying sum of the 1's
measures the current popularity of the movie.
We imagine that the number of possible movies in the stream is huge, so we
do not want to record values for the unpopular movies. Therefore, we establish
a threshold, say 1/2, so that if the popularity score for a movie goes below this
number, its score is dropped from the counting. For reasons that will become
obvious, the threshold must be less than 1, although it can be any number less
than 1. When a new ticket arrives on the stream, do the following:
1. For each movie whose score we are currently maintaining, multiply its
score by (1−c).
2. Suppose the new ticket is for movie M . If there is currently a score for M ,
add 1 to that score. If there is no score for M , create one and initialize it
to 1.
3. If any score is below the threshold 1/2, drop that score.
It may not be obvious that the number of movies whose scores are main-
tained at any time is limited. However, note that the sum of all scores is 1/c.
There cannot be more than 2/c movies with score of 1/2 or more, or else the
sum of the scores would exceed 1/c. Thus, 2/c is a limit on the number of
movies being counted at any time. Of course in practice, the ticket sales would
be concentrated on only a small number of movies at any time, so the number
of actively counted movies would be much less than 2/c.
5 This example should be taken with a grain of salt, because, as we pointed out, there
aren't enough different movies for this technique to be essential. Imagine, if you will, that
the number of movies is extremely large, so counting ticket sales of each one separately is not
feasible.
Search WWH ::




Custom Search