MINING SENSOR DATA STREAMS - Managing and Mining Sensor Data

Database Reference

In-Depth Information

random number generator approach and the sketch representation are

as follows:

A given component r i can be generated in poly-logarithmic time

from the seed. The time for generating the seed is poly-logarithmic

in the domain size of the underlying data.

A variety of approximate aggregate functions on the original data

can be computed using the sketches.

Some example of functions which can be computed from the sketch com-

ponents are as follows:

Dot Product of two streams: If ( s 1 ...s k ) be the sketches from

one stream, and ( t 1 ...t k ) be the sketches from the other stream,

then s j cdott j is a random variable whose expected value of the dot

product.

Second Moment: If ( s 1 ...s k ) be the sketch components for a

data stream, it can be shown that the expected value of s j is the

second moment. Furthermore, by using Chernoff bounds, it can

be shown that by selecting the median of O (log(1 /δ ) averages of

O (1 / 2 )copiesof s j ] cdott j , it is possible to guarantee the accuracy

of the approximation to within 1 + with probability at least 1

−

δ .

Frequent Items: The frequency of the i th item in the data stream

is computed by by multiplying the sketch component s j by r i .

However, this estimation is accurate only for the case of frequent

items, since the error is estimation is proportional to the overall

frequency of the items in the data stream.

More details of computations which one can perform with the AMS

sketch are discussed in [17, 18].

The second kind of sketch which is used for counting is the count-min

sketch [31]. The count-min sketch is based upon the concept of hashing,

and uses k =ln(1 /δ ) pairwise-independent hash functions, which hash

onto integers in the range (0 ...e/ ). For each incoming item, the k hash

functions are applied and the frequency count is incremented by 1. In

the event that the incoming item has frequency f , then the correspond-

ing frequency count is incremented by f . Note that by hashing an item

into the k cells, we are ensuring that we maintain an overestimate on the

corresponding frequency. It can be shown that the minimum of these

cells provides the -accurate estimate to the frequency with probability

at least 1

δ . It has been shown in [31] that the method can also be

naturally extended to other problems such as finding the dot product

−

Search WWH ::

Custom Search

Home