Database Reference
In-Depth Information
7.6.1
The Stream-Computing Model
We assume that each stream element is a point in some space. The sliding window consists
of the most recent N points. Our goal is to precluster subsets of the points in the stream, so
that we may quickly answer queries of the form “what are the clusters of the last m points?”
for any m N . There are many variants of this query, depending on what we assume about
what constitutes a cluster. For instance, we may use a k -means approach, where we are
really asking that the last m points be partitioned into exactly k clusters. Or, we may allow
the number of clusters to vary, but use one of the criteria in Section 7.2.3 or 7.2.4 to de-
termine when to stop merging clusters into larger clusters.
We make no restriction regarding the space in which the points of the stream live. It may
be a Euclidean space, in which case the answer to the query is the centroids of the selected
clusters. The space may be non-Euclidean, in which case the answer is the clustroids of
the selected clusters, where any of the definitions for “clustroid” may be used (see Section
7.2.4 ) .
The problem is considerably easier if we assume that all stream elements are chosen with
statistics that do not vary along the stream. Then, a sample of the stream is good enough
to estimate the clusters, and we can in effect ignore the stream after a while. However, the
stream model normally assumes that the statistics of the stream elements varies with time.
For example, the centroids of the clusters may migrate slowly as time goes on, or clusters
may expand, contract, divide, or merge.
7.6.2
A Stream-Clustering Algorithm
In this section, we shall present a greatly simplified version of an algorithm referred to as
BDMO (for the authors, B. Babcock, M. Datar, R. Motwani, and L. O'Callaghan). The true
version of the algorithm involves much more complex structures, which are designed to
provide performance guarantees in the worst case.
The BDMO Algorithm builds on the methodology for counting ones in a stream that was
described in Section 4.6 . Here are the key similarities and differences:
• Like that algorithm, the points of the stream are partitioned into, and summarized
by, buckets whose sizes are a power of two. Here, the size of a bucket is the number
of points it represents, rather than the number of stream elements that are 1.
• As before, the sizes of buckets obey the restriction that there are one or two of each
size, up to some limit. However, we do not assume that the sequence of allowable
bucket sizes starts with 1. Rather, they are required only to form a sequence where
each size is twice the previous size, e.g., 3 , 6 , 12 , 24 , . . . .
Search WWH ::




Custom Search