Databases Reference
In-Depth Information
each of these candidate clustroids using the formulas given in Section 7.5.4. We
also follow the strategy given in that section to pick a subset of the distant
points from each cluster to be the set of distant points for the merged cluster,
and to compute the new rowsum and distance-to-clustroid for each.
2
7.6.5
Answering Queries
Recall that we assume a query is a request for the clusters of the most recent m
points in the stream, where m≤N . Because of the strategy we have adopted
of combining buckets as we go back in time, we may not be able to find a set
of buckets that covers exactly the last m points. However, if we choose the
smallest set of buckets that cover the last m points, we shall include in these
buckets no more than the last 2m points. We shall produce, as answer to the
query, the centroids or clustroids of all the points in the selected buckets. In
order for the result to be a good approximation to the clusters for exactly the
last m points, we must assume that the points between 2m and m + 1 will not
have radically different statistics from the most recent m points. However, if
the statistics vary too rapidly, recall from Section 4.6.6 that a more complex
bucketing scheme can guarantee that we can find buckets to cover at most the
last m(1 + ǫ) points, for any ǫ > 0.
Having selected the desired buckets, we pool all their clusters. We then use
some methodology for deciding which clusters to merge. Examples 7.14 and
7.16 are illustrations of two approaches to this merger. For instance, if we are
required to produce exactly k clusters, then we can merge the clusters with the
closest centroids until we are left with only k clusters, as in Example 7.14. Or
we can make a decision whether or not to merge clusters in various ways, as we
sampled in Example 7.16.
7.6.6
Clustering in a Parallel Environment
Now, let us briefly consider the use of parallelism available in a computing
cluster. 3 We assume we are given a very large collection of points, and we wish
to exploit parallelism to compute the centroids of their clusters. The simplest
approach is to use a map-reduce strategy, but in most cases we are constrained
to use a single Reduce task.
Begin by creating many Map tasks. Each task is assigned a subset of the
points. The Map function's job is to cluster the points it is given. Its output is
a set of key-value pairs with a fixed key 1, and a value that is the description
of one cluster. This description can be any of the possibilities suggested in
Section 7.6.2, such as the centroid, count, and diameter of the cluster.
Since all key-value pairs have the same key, there can be only one Reduce
task. This task gets descriptions of the clusters produced by each of the Map
3 Do not forget that the term “cluster” has two completely different meanings in this
section.
Search WWH ::




Custom Search