Clustering - Mining of Massive Datasets

Databases Reference

In-Depth Information

each of these candidate clustroids using the formulas given in Section 7.5.4. We

also follow the strategy given in that section to pick a subset of the distant

points from each cluster to be the set of distant points for the merged cluster,

and to compute the new rowsum and distance-to-clustroid for each.

2

7.6.5

Answering Queries

Recall that we assume a query is a request for the clusters of the most recent m

points in the stream, where m≤N . Because of the strategy we have adopted

of combining buckets as we go back in time, we may not be able to find a set

of buckets that covers exactly the last m points. However, if we choose the

smallest set of buckets that cover the last m points, we shall include in these

buckets no more than the last 2m points. We shall produce, as answer to the

query, the centroids or clustroids of all the points in the selected buckets. In

order for the result to be a good approximation to the clusters for exactly the

last m points, we must assume that the points between 2m and m + 1 will not

have radically different statistics from the most recent m points. However, if

the statistics vary too rapidly, recall from Section 4.6.6 that a more complex

bucketing scheme can guarantee that we can find buckets to cover at most the

last m(1 + ǫ) points, for any ǫ > 0.

Having selected the desired buckets, we pool all their clusters. We then use

some methodology for deciding which clusters to merge. Examples 7.14 and

7.16 are illustrations of two approaches to this merger. For instance, if we are

required to produce exactly k clusters, then we can merge the clusters with the

closest centroids until we are left with only k clusters, as in Example 7.14. Or

we can make a decision whether or not to merge clusters in various ways, as we

sampled in Example 7.16.

7.6.6

Clustering in a Parallel Environment

Now, let us briefly consider the use of parallelism available in a computing

cluster. 3 We assume we are given a very large collection of points, and we wish

to exploit parallelism to compute the centroids of their clusters. The simplest

approach is to use a map-reduce strategy, but in most cases we are constrained

to use a single Reduce task.

Begin by creating many Map tasks. Each task is assigned a subset of the

points. The Map function's job is to cluster the points it is given. Its output is

a set of key-value pairs with a fixed key 1, and a value that is the description

of one cluster. This description can be any of the possibilities suggested in

Section 7.6.2, such as the centroid, count, and diameter of the cluster.

Since all key-value pairs have the same key, there can be only one Reduce

task. This task gets descriptions of the clusters produced by each of the Map

3 Do not forget that the term “cluster” has two completely different meanings in this

section.

Mining of Massive Datasets

Search WWH ::

Custom Search

Home