Database Reference
In-Depth Information
stability of FP and FN rates at large stream lengths. However, the FNR attained,
although constant, is extremely high.
13.4 RESERVOIR SAMPLING MECHANISM
Finding the number of distinct elements in a stream was explored in [18]. The prob-
lem of synopsis maintenance [5,22] has been studied in great detail for its exten-
sive application in query estimation [32]. Many synopsis methods such as sampling,
wavelets, histograms, and sketches have been designed for approximate query
answering. A comprehensive survey of stream synopsis methods can be found in [2].
An important class of synopsis construction methods is the reservoir sampling [38].
This sampling method has great appeal as it generates a sample of original multi-
dimensional data for various data mining applications.
In reservoir sampling, one maintains a reservoir of size n from the data stream.
After the first n points have been added to the reservoir, subsequent elements are
inserted into the reservoir with an insertion probability given by n / t for the t th
element of the stream. An interesting characteristic of this algorithm is that it is
extremely easy to implement and that all subsets of data are equi-probable to be
present in the reservoir. Each data point is also associated with a bias function rep-
resenting its probability to be inserted into the reservoir. Hence, the biasing function
captures the changing behavior of the stream.
Property
After t points in the data stream have been processed, the probability of any point in
the stream belonging to the sample of size n is equal to n / t .
One interesting characteristic of this maintenance algorithm is that it is extremely
efficient to implement in practice. When new points in the stream arrive, it needs to
be decided whether or not to insert into the current sample array that represents the
reservoir. The sample array can then be overwritten at a random position. The bias
function [3] associated with the r th data point at the time of arrival of the t th point
( r  ≤ t ) is given by f ( r , t ) and is related to the probability p ( r , t ) of the r th point belong-
ing to the reservoir at the time of arrival of the t th point. Specifically, p ( r , t ) is pro-
portional to f ( r , t ) . The function f ( r , t ) is monotonically decreasing with t (for fixed r )
and monotonically increasing with r (for fixed t ) . Therefore, the use of a bias function
ensures that recent points have higher probability of being represented in the sample
reservoir. Hence, defining the concept of a bias-sensitive sample S ( t ), which in turn
is defined by the bias function f ( r , t ) as
Definition Let f ( r , t ) be the bias function for the r th point at the arrival of the t th
point. A biased sample S ( t ) at the time of arrival of the t th point in the stream is
defined as a sample such that the relative probability p ( r, t ) of the r th point belonging
to the sample S ( t ) (of size n ) is proportional to f ( r , t ) .
Search WWH ::




Custom Search