Database Reference
In-Depth Information
random number generator approach and the sketch representation are
as follows:
A given component r i can be generated in poly-logarithmic time
from the seed. The time for generating the seed is poly-logarithmic
in the domain size of the underlying data.
A variety of approximate aggregate functions on the original data
can be computed using the sketches.
Some example of functions which can be computed from the sketch com-
ponents are as follows:
Dot Product of two streams: If ( s 1 ...s k ) be the sketches from
one stream, and ( t 1 ...t k ) be the sketches from the other stream,
then s j cdott j is a random variable whose expected value of the dot
product.
Second Moment: If ( s 1 ...s k ) be the sketch components for a
data stream, it can be shown that the expected value of s j is the
second moment. Furthermore, by using Chernoff bounds, it can
be shown that by selecting the median of O (log(1 ) averages of
O (1 / 2 )copiesof s j ] cdott j , it is possible to guarantee the accuracy
of the approximation to within 1 + with probability at least 1
δ .
Frequent Items: The frequency of the i th item in the data stream
is computed by by multiplying the sketch component s j by r i .
However, this estimation is accurate only for the case of frequent
items, since the error is estimation is proportional to the overall
frequency of the items in the data stream.
More details of computations which one can perform with the AMS
sketch are discussed in [17, 18].
The second kind of sketch which is used for counting is the count-min
sketch [31]. The count-min sketch is based upon the concept of hashing,
and uses k =ln(1 ) pairwise-independent hash functions, which hash
onto integers in the range (0 ...e/ ). For each incoming item, the k hash
functions are applied and the frequency count is incremented by 1. In
the event that the incoming item has frequency f , then the correspond-
ing frequency count is incremented by f . Note that by hashing an item
into the k cells, we are ensuring that we maintain an overestimate on the
corresponding frequency. It can be shown that the minimum of these
cells provides the -accurate estimate to the frequency with probability
at least 1
δ . It has been shown in [31] that the method can also be
naturally extended to other problems such as finding the dot product
Search WWH ::




Custom Search