Database Reference
In-Depth Information
tional database applications, and are extremely popular because of their
broad applicability across a wide array of tasks in data streams. A fur-
ther advantage of sampling methods is that unlike many other synopsis
construction methods, they maintain their inter-attribute correlations
across samples of the data. It is also often possible to use probabilistic
inequalities in order to bound the effectiveness of a variety of applica-
tions with sampling methods.
However, a key problem in extending sampling methods to the data
stream scenario, is that one does not know the total number of data
points to be sampled in advance. Rather, one must maintain the sample
in a dynamic way over the entire course of the computation. A method
called reservoir sampling was first proposed in [72], which maintains such
a sample dynamically. This technique was originally proposed in the con-
text of one-pass access of data from magnetic-storage devices. However,
the techniques also naturally extend to the data stream scenario.
Let us consider the case, where we wish to obtain an unbiased sample
of size n from the data stream. In order to initialize the approach, we
simply add the first n points from the stream to the reservoir. Subse-
quently, when the ( t + 1)th point is received, it is added to the reservoir
with probability n/ ( t +1). Whenthedatapointisaddedtothereser-
voir, it replaces a random point from the reservoir. It can be shown that
this simple approach maintains the uniform sampling distribution from
the data stream. We note that the uniform sampling approach may not
be very effective in cases where the data stream evolves significantly. In
such cases, one may either choose to generate the stream sample over
a sliding window, or use a decay-based approach in order to bias the
sample. An approach for sliding window computation over data streams
is discussed in [20].
A second approach [6] uses biased decay functions in order to construct
synopsis from data streams. It has been shown in [6] that the problem
is extremely dicult for arbitrary decay functions. In such cases, there
is no known solution to the problem. However, it is possible to design
very simple algorithms for some important classes of decay functions.
One of these classes of decay functions is the exponential decay function .
The exponential decay function is extremely important because of its
memory less property , which guarantees that the future treatment of a
data point is independent of the past data points which have arrived.
An interesting result is that by making simple implementation modifi-
cations to the algorithm of [72] in terms of modifying the probabilities
of insertion and deletion, it is possible to construct a robust algorithm
for this problem. It has been shown in [6] that the approach is quite
Search WWH ::




Custom Search