Database Reference
In-Depth Information
Any of the other strategies discussed for deciding when to merge clusters may be used as
well, provided we arrange to maintain the data (e.g., cluster diameter) necessary to make
decisions.
We then must pick a new clustroid, from among the points most distant from the clus-
troids of the two merged clusters. We can compute rowsums for each of these candidate
clustroids using the formulas given in Section 7.5.4 . We also follow the strategy given in
that section to pick a subset of the distant points from each cluster to be the set of distant
points for the merged cluster, and to compute the new rowsum and distance-to-clustroid for
each.
7.6.5
Answering Queries
Recall that we assume a query is a request for the clusters of the most recent m points in
the stream, where m N . Because of the strategy we have adopted of combining buckets as
we go back in time, we may not be able to find a set of buckets that covers exactly the last
m points. However, if we choose the smallest set of buckets that cover the last m points, we
shall include in these buckets no more than the last 2 m points. We shall produce, as answer
to the query, the centroids or clustroids of all the points in the selected buckets. In order for
the result to be a good approximation to the clusters for exactly the last m points, we must
assume that the points between 2 m and m + 1 will not have radically different statistics
from the most recent m points. However, if the statistics vary too rapidly, recall from Sec-
tion 4.6.6 that a more complex bucketing scheme can guarantee that we can find buckets to
cover at most the last m (1 + ϵ ) points, for any ϵ > 0.
Having selected the desired buckets, we pool all their clusters. We then use some meth-
odology for deciding which clusters to merge. Examples 7.14 and 7.16 are illustrations
of two approaches to this merger. For instance, if we are required to produce exactly k
clusters, then we can merge the clusters with the closest centroids until we are left with
only k clusters, as in Example 7.14 . Or we can make a decision whether or not to merge
clusters in various ways, as we sampled in Example 7.16 .
7.6.6
Clustering in a Parallel Environment
Now, let us briefly consider the use of parallelism available in a computing cluster. 3 We
assume we are given a very large collection of points, and we wish to exploit parallelism
to compute the centroids of their clusters. The simplest approach is to use a MapReduce
strategy, but in most cases we are constrained to use a single Reduce task.
Search WWH ::




Custom Search