MINING SENSOR DATA STREAMS - Managing and Mining Sensor Data

Database Reference

In-Depth Information

effective in practice, especially when there is significant evolution of the

underlying data stream.

While sampling has several advantages in terms of simplicity and

preservation of multi-dimensional correlations, it loses its effectiveness in

the presence of data sparsity. For example, a query which contains very

few data points is unlikely to be accurate with the use of a sampling ap-

proach. However, this is a general problem with most techniques which

are effective at counting frequent elements, but are not quite as effective

at counting rare or distinct elements in the data stream.

Sketches: Sketches use some properties of random sampling in order

to perform counting tasks in data streams. Sketches are most useful

when the domain size of a data stream is very large. In such cases,

the number of possible distinct elements become very large, and it is no

longer possible to track them in space-constrained scenarios. There are

two broad classes of sketches: projection based and hash based . We will

discuss each of them in turn.

Projection based sketches are constructed on the broad idea of ran-

dom projection [54]. The most well known projection-based sketch is

the AMS sketch [17, 18], which we will discuss below. It has been shown

in [54], that by by randomly sampling subspaces from multi-dimensional

data, it is possible to compute -accurate projections of the data with

high probability. This broad idea can easily be extended to the mas-

sive domain case, by viewing each distinct item as a dimension, and the

counts on these items as the corresponding values. The main problem is

that the vector for performing the projection cannot be maintained ex-

plicitly since the length of such a vector would be of the same size as the

number of distinct elements. In fact, since the sketch-based method is

most relevant in the distinct element scenario, such an approach defeats

the purpose of keeping a synopsis structure in the first place.

Let us assume that the random projection is performed using k sketch

vectors, and r i represents the j th vector for the i th item in the domain

being tracked. In order to achieve the goal of ecient synopsis construc-

tion, we store the random vectors implicitly in the form of a seed, and

this can be used to dynamically generate the vector. The main idea

discussed in [49] is that it is possible to generate random vectors with

a seed of size O (log( N )), provided that one is willing to work with the

restriction that r i ∈{−

should be 4-wise independent. The sketch

is computed by adding r i to the j th component of the sketch for the i th

item. In the event that the incoming item has frequency f ,weaddthe

value f

1 , +1

}

r i . Let us assume that there are a total of k sketch components

which are denoted by ( s 1 ...s k ). Some key properties of the pseudo-

·

Managing and Mining Sensor Data

Search WWH ::

Custom Search

Home