Clustering - Mining of Massive Datasets

Database Reference

In-Depth Information

than k , and, when we merge, only merge clusters when the result is sufficiently coherent

according to one of the criteria outlined in Section 7.2.3 . Or, we could use a hierarchical

strategy, and make the best merges, so as to maintain p > k clusters in each bucket.

Suppose, to be specific, that we want to put a limit on the sum of the distances between

all the points of a cluster and its centroid. Then in addition to the count of points and the

centroid of a cluster, we can include an estimate of this sum in the record for a cluster.

When we initialize a bucket, we can compute the sum exactly. But as we merge clusters,

this parameter becomes an estimate only. Suppose we merge two clusters, and want to com-

pute the sum of distances for the merged cluster. Use the notation for centroids and counts

in Example 7.14 , and in addition, let s 1 and s 2 be the sums for the two clusters. Then we

may estimate the radius of the merged cluster to be

n 1 | c 1 − c | + n 2 | c 2 − c | + s 1 + s 2

That is, we estimate the distance between any point x and the new centroid c to be the dis-

tance of that point to its old centroid (these distances sum to s 1 + s 2 , the last two terms in the

above expression) plus the distance from the old centroid to the new (these distances sum

to the first two terms of the above expression). Note that this estimate is an upper bound,

by the triangle inequality.

An alternative is to replace the sum of distances by the sum of the squares of the dis-

tances from the points to the centroid. If these sums for the two clusters are t 1 and t 2 , re-

spectively, then we can produce an estimate for the same sum in the new cluster as

n 1 | c 1 − c | 2 + n 2 | c 2 − c | 2 + t 1 + t 2

This estimate is close to correct if the space is high-dimensional, by the “curse of dimen-

sionality.”

□

EXAMPLE 7.16 Our third example will assume a non-Euclidean space and no constraint on

the number of clusters. We shall borrow several of the techniques from the GRGPF Al-

gorithm of Section 7.5 . Specifically, we represent clusters by their clustroid and rowsum

(sum of the squares of the distances from each node of the cluster to its clustroid). We in-

clude in the record for a cluster information about a set of points at maximum distance from

the clustroid, including their distances from the clustroid and their rowsums. Recall that

their purpose is to suggest a clustroid when this cluster is merged with another.

When we merge buckets, we may choose one of many ways to decide which clusters

to merge. For example, we may consider pairs of clusters in order of the distance between

their clustroids. We may also choose to merge clusters when we consider them, provided

the sum of their rowsums is below a certain limit. Alternatively, we may perform the merge

if the sum of rowsums divided by the number of points in the clusters is below a limit.

Search WWH ::

Custom Search

Home