Databases Reference
In-Depth Information
However, to make sense of the formula, we need to change the order of
summation by grouping all those positions that have the same element. For
instance, concentrate on some element a that appears m a times in the stream.
The term for the last position in which a appears must be 2×1−1 = 1. The
term for the next-to-last position in which a appears is 2×2−1 = 3.
The
positions with a before that yield terms 5, 7, and so on, up to 2m a
−1, which
is the term for the first position in which a appears. That is, the formula for
the expected value of X.value can be written:
E(X.value) =
1 + 3 + 5 ++ (2m a
−1)
a
−1) = (m a ) 2 . The proof is an easy induction
on the number of terms in the sum. Thus, E(X.value) =
Note that 1 + 3 + 5 ++ (2m a
a (m a ) 2 , which is
the definition of the second moment.
4.5.4
Higher-Order Moments
We estimate kth moments, for k > 2, in essentially the same way as we estimate
second moments. The only thing that changes is the way we derive an estimate
from a variable. In Section 4.5.2 we used the formula n×(2v−1) to turn a value
v, the count of the number of occurrences of some particular stream element
a, into an estimate of the second moment. Then, in Section 4.5.3 we saw why
this formula works: the terms 2v−1, for v = 1, 2, . . . , m sum to m 2 , where m
is the number of times a appears in the stream.
Notice that 2v−1 is the difference between v 2 and (v−1) 2 . Suppose we
wanted the third moment rather than the second. Then all we have to do is
replace 2v−1 by v 3 −(v−1) 3 = 3v 2 −3v+1. Then
m
v=1 3v 2 −3v+1 = m 3 , so we
can use as our estimate of the third moment the formula n×(3v 2 −3v+1), where
v = X.value is the value associated with some variable X. More generally, we
can estimate kth moments for any k≥2 by turning value v = X.value into
v k −(v−1) k
.
4.5.5
Dealing With Infinite Streams
Technically, the estimate we used for second and higher moments assumes that
n, the stream length, is a constant. In practice, n grows with time. That fact,
by itself, doesn't cause problems, since we store only the values of variables
and multiply some function of that value by n when it is time to estimate the
moment. If we count the number of stream elements seen and store this value,
which only requires log n bits, then we have n available whenever we need it.
A more serious problem is that we must be careful how we select the positions
for the variables. If we do this selection once and for all, then as the stream gets
longer, we are biased in favor of early positions, and the estimate of the moment
will be too large. On the other hand, if we wait too long to pick positions, then
Search WWH ::




Custom Search