Database Reference
In-Depth Information
than k , and, when we merge, only merge clusters when the result is sufficiently coherent
according to one of the criteria outlined in Section 7.2.3 . Or, we could use a hierarchical
strategy, and make the best merges, so as to maintain p > k clusters in each bucket.
Suppose, to be specific, that we want to put a limit on the sum of the distances between
all the points of a cluster and its centroid. Then in addition to the count of points and the
centroid of a cluster, we can include an estimate of this sum in the record for a cluster.
When we initialize a bucket, we can compute the sum exactly. But as we merge clusters,
this parameter becomes an estimate only. Suppose we merge two clusters, and want to com-
pute the sum of distances for the merged cluster. Use the notation for centroids and counts
in Example 7.14 , and in addition, let s 1 and s 2 be the sums for the two clusters. Then we
may estimate the radius of the merged cluster to be
n 1 | c 1 c | + n 2 | c 2 c | + s 1 + s 2
That is, we estimate the distance between any point x and the new centroid c to be the dis-
tance of that point to its old centroid (these distances sum to s 1 + s 2 , the last two terms in the
above expression) plus the distance from the old centroid to the new (these distances sum
to the first two terms of the above expression). Note that this estimate is an upper bound,
by the triangle inequality.
An alternative is to replace the sum of distances by the sum of the squares of the dis-
tances from the points to the centroid. If these sums for the two clusters are t 1 and t 2 , re-
spectively, then we can produce an estimate for the same sum in the new cluster as
n 1 | c 1 c | 2 + n 2 | c 2 c | 2 + t 1 + t 2
This estimate is close to correct if the space is high-dimensional, by the “curse of dimen-
sionality.”
EXAMPLE 7.16 Our third example will assume a non-Euclidean space and no constraint on
the number of clusters. We shall borrow several of the techniques from the GRGPF Al-
gorithm of Section 7.5 . Specifically, we represent clusters by their clustroid and rowsum
(sum of the squares of the distances from each node of the cluster to its clustroid). We in-
clude in the record for a cluster information about a set of points at maximum distance from
the clustroid, including their distances from the clustroid and their rowsums. Recall that
their purpose is to suggest a clustroid when this cluster is merged with another.
When we merge buckets, we may choose one of many ways to decide which clusters
to merge. For example, we may consider pairs of clusters in order of the distance between
their clustroids. We may also choose to merge clusters when we consider them, provided
the sum of their rowsums is below a certain limit. Alternatively, we may perform the merge
if the sum of rowsums divided by the number of points in the clusters is below a limit.
Search WWH ::




Custom Search