Clustering - Mining of Massive Datasets

Database Reference

In-Depth Information

7.6.1

The Stream-Computing Model

We assume that each stream element is a point in some space. The sliding window consists

of the most recent N points. Our goal is to precluster subsets of the points in the stream, so

that we may quickly answer queries of the form “what are the clusters of the last m points?”

for any m ≤ N . There are many variants of this query, depending on what we assume about

what constitutes a cluster. For instance, we may use a k -means approach, where we are

really asking that the last m points be partitioned into exactly k clusters. Or, we may allow

the number of clusters to vary, but use one of the criteria in Section 7.2.3 or 7.2.4 to de-

termine when to stop merging clusters into larger clusters.

We make no restriction regarding the space in which the points of the stream live. It may

be a Euclidean space, in which case the answer to the query is the centroids of the selected

clusters. The space may be non-Euclidean, in which case the answer is the clustroids of

the selected clusters, where any of the definitions for “clustroid” may be used (see Section

7.2.4 ) .

The problem is considerably easier if we assume that all stream elements are chosen with

statistics that do not vary along the stream. Then, a sample of the stream is good enough

to estimate the clusters, and we can in effect ignore the stream after a while. However, the

stream model normally assumes that the statistics of the stream elements varies with time.

For example, the centroids of the clusters may migrate slowly as time goes on, or clusters

may expand, contract, divide, or merge.

7.6.2

A Stream-Clustering Algorithm

In this section, we shall present a greatly simplified version of an algorithm referred to as

BDMO (for the authors, B. Babcock, M. Datar, R. Motwani, and L. O'Callaghan). The true

version of the algorithm involves much more complex structures, which are designed to

provide performance guarantees in the worst case.

The BDMO Algorithm builds on the methodology for counting ones in a stream that was

described in Section 4.6 . Here are the key similarities and differences:

• Like that algorithm, the points of the stream are partitioned into, and summarized

by, buckets whose sizes are a power of two. Here, the size of a bucket is the number

of points it represents, rather than the number of stream elements that are 1.

• As before, the sizes of buckets obey the restriction that there are one or two of each

size, up to some limit. However, we do not assume that the sequence of allowable

bucket sizes starts with 1. Rather, they are required only to form a sequence where

each size is twice the previous size, e.g., 3 , 6 , 12 , 24 , . . . .

Search WWH ::

Custom Search

Home