Database Reference
In-Depth Information
4.5.4
Higher-Order Moments
We estimate k th moments, for k > 2, in essentially the same way as we estimate second
moments. The only thing that changes is the way we derive an estimate from a variable. In
Section 4.5.2 we used the formula n × (2 v − 1) to turn a value v , the count of the number of
occurrences of some particular stream element a , into an estimate of the second moment.
Then, in Section 4.5.3 we saw why this formula works: the terms 2 v − 1, for v = 1 , 2 , . . . ,
m sum to m 2 , where m is the number of times a appears in the stream.
Notice that 2 v −1 is the difference between v 2 and ( v −1) 2 . Suppose we wanted the third
moment rather than the second. Then all we have to do is replace 2 v −1 by v 3 −( v −1) 3 = 3 v 2
−3 v +1. Then −3 v +1 = m 3 , so we can use as our estimate of the third moment the formula
n × (3 v 2 − 3 v + 1), where v = X.value is the value associated with some variable X . More
generally, we can estimate k th moments for any k ≥ 2 by turning value v = X.value into n ×
( v k −( v −1) k ).
4.5.5
Dealing With Infinite Streams
Technically, the estimate we used for second and higher moments assumes that n , the
stream length, is a constant. In practice, n grows with time. That fact, by itself, doesn't
cause problems, since we store only the values of variables and multiply some function of
that value by n when it is time to estimate the moment. If we count the number of stream
elements seen and store this value, which only requires log n bits, then we have n available
whenever we need it.
A more serious problem is that we must be careful how we select the positions for the
variables. If we do this selection once and for all, then as the stream gets longer, we are
biased in favor of early positions, and the estimate of the moment will be too large. On the
other hand, if we wait too long to pick positions, then early in the stream we do not have
many variables and so will get an unreliable estimate.
The proper technique is to maintain as many variables as we can store at all times, and
to throw some out as the stream grows. The discarded variables are replaced by new ones,
in such a way that at all times, the probability of picking any one position for a variable is
the same as that of picking any other position. Suppose we have space to store s variables.
Then the first s positions of the stream are each picked as the position of one of the s vari-
ables.
Inductively, suppose we have seen n stream elements, and the probability of any partic-
ular position being the position of a variable is uniform, that is s / n . When the ( n +1)st ele-
ment arrives, pick that position with probability s /( n +1). If not picked, then the s variables
Search WWH ::




Custom Search