Database Reference
In-Depth Information
is that the distribution of data points within a bucket is not retained,
and is therefore assumed to be uniform. This causes inaccuracy because
of extrapolation at the query boundaries. A natural choice is to use
an equal number of counts in each bucket. This minimizes the error
variation across different buckets. However, in the case of data streams,
the boundaries to be used for equi-depth histogram construction are not
known a-priori. We further note that the design of equi-depth buckets
is exactly the problem of quantile estimation, since the equi-depth par-
titions define the quantiles in the data. Another choice of histogram
construction is that of minimizing the variance of frequency variances of
different values in the bucket. This ensures that the uniform distribution
assumption is approximately held, when extrapolating the frequencies
of the buckets at the two ends of a query. Such histograms are referred
to as V-optimal histograms. Algorithms for V-optimal histogram con-
struction are proposed in [51, 52]. A more detailed discussion of several
algorithms for histogram construction may be found in [4].
3.6 Dimensionality Reduction and Forecasting
in Data Streams
Because of the inherent temporal nature of data streams, the problems
of dimensionality reduction and forecasting and particularly important.
When there are a large number of simultaneous data stream, we can use
the correlations between different data streams in order to make effec-
tive predictions [70, 75] on the future behavior of the data stream. In
particular, the well known MUSCLES method [75] is useful in applying
regression analysis to data streams. The regression analysis is helpful
in predicting the future behavior of the data stream. A related tech-
nique is the SPIRIT algorithm, which explores the relationship between
dimensionality reduction and forecasting in data streams. The primary
idea is that a compact number of hidden variables can be used to com-
prehensively describe the data stream. This compact representation can
also be used for effective forecasting of the data streams. A discussion
of different dimensionality reduction and forecasting methods (including
SPIRIT) is provided in [4].
3.7 Distributed Mining of Data Streams
In many instances, streams are generated at multiple distributed com-
puting nodes. An example of such a case would be sensor networks in
which the streams are generated at different sensor nodes. Analyzing and
monitoring data in such environments requires data mining technology
that requires optimization of a variety of criteria such as communication
Search WWH ::




Custom Search