MINING SENSOR DATA STREAMS - Managing and Mining Sensor Data

Database Reference

In-Depth Information

tional database applications, and are extremely popular because of their

broad applicability across a wide array of tasks in data streams. A fur-

ther advantage of sampling methods is that unlike many other synopsis

construction methods, they maintain their inter-attribute correlations

across samples of the data. It is also often possible to use probabilistic

inequalities in order to bound the effectiveness of a variety of applica-

tions with sampling methods.

However, a key problem in extending sampling methods to the data

stream scenario, is that one does not know the total number of data

points to be sampled in advance. Rather, one must maintain the sample

in a dynamic way over the entire course of the computation. A method

called reservoir sampling was first proposed in [72], which maintains such

a sample dynamically. This technique was originally proposed in the con-

text of one-pass access of data from magnetic-storage devices. However,

the techniques also naturally extend to the data stream scenario.

Let us consider the case, where we wish to obtain an unbiased sample

of size n from the data stream. In order to initialize the approach, we

simply add the first n points from the stream to the reservoir. Subse-

quently, when the ( t + 1)th point is received, it is added to the reservoir

with probability n/ ( t +1). Whenthedatapointisaddedtothereser-

voir, it replaces a random point from the reservoir. It can be shown that

this simple approach maintains the uniform sampling distribution from

the data stream. We note that the uniform sampling approach may not

be very effective in cases where the data stream evolves significantly. In

such cases, one may either choose to generate the stream sample over

a sliding window, or use a decay-based approach in order to bias the

sample. An approach for sliding window computation over data streams

is discussed in [20].

A second approach [6] uses biased decay functions in order to construct

synopsis from data streams. It has been shown in [6] that the problem

is extremely dicult for arbitrary decay functions. In such cases, there

is no known solution to the problem. However, it is possible to design

very simple algorithms for some important classes of decay functions.

One of these classes of decay functions is the exponential decay function .

The exponential decay function is extremely important because of its

memory less property , which guarantees that the future treatment of a

data point is independent of the past data points which have arrived.

An interesting result is that by making simple implementation modifi-

cations to the algorithm of [72] in terms of modifying the probabilities

of insertion and deletion, it is possible to construct a robust algorithm

for this problem. It has been shown in [6] that the approach is quite

Search WWH ::

Custom Search

Home