Clustering - Mining of Massive Datasets

Database Reference

In-Depth Information

Any of the other strategies discussed for deciding when to merge clusters may be used as

well, provided we arrange to maintain the data (e.g., cluster diameter) necessary to make

decisions.

We then must pick a new clustroid, from among the points most distant from the clus-

troids of the two merged clusters. We can compute rowsums for each of these candidate

clustroids using the formulas given in Section 7.5.4 . We also follow the strategy given in

that section to pick a subset of the distant points from each cluster to be the set of distant

points for the merged cluster, and to compute the new rowsum and distance-to-clustroid for

each.

□

7.6.5

Answering Queries

Recall that we assume a query is a request for the clusters of the most recent m points in

the stream, where m ≤ N . Because of the strategy we have adopted of combining buckets as

we go back in time, we may not be able to find a set of buckets that covers exactly the last

m points. However, if we choose the smallest set of buckets that cover the last m points, we

shall include in these buckets no more than the last 2 m points. We shall produce, as answer

to the query, the centroids or clustroids of all the points in the selected buckets. In order for

the result to be a good approximation to the clusters for exactly the last m points, we must

assume that the points between 2 m and m + 1 will not have radically different statistics

from the most recent m points. However, if the statistics vary too rapidly, recall from Sec-

tion 4.6.6 that a more complex bucketing scheme can guarantee that we can find buckets to

cover at most the last m (1 + ϵ ) points, for any ϵ > 0.

Having selected the desired buckets, we pool all their clusters. We then use some meth-

odology for deciding which clusters to merge. Examples 7.14 and 7.16 are illustrations

of two approaches to this merger. For instance, if we are required to produce exactly k

clusters, then we can merge the clusters with the closest centroids until we are left with

only k clusters, as in Example 7.14 . Or we can make a decision whether or not to merge

clusters in various ways, as we sampled in Example 7.16 .

7.6.6

Clustering in a Parallel Environment

Now, let us briefly consider the use of parallelism available in a computing cluster. 3 We

assume we are given a very large collection of points, and we wish to exploit parallelism

to compute the centroids of their clusters. The simplest approach is to use a MapReduce

strategy, but in most cases we are constrained to use a single Reduce task.

Search WWH ::

Custom Search

Home