Database Reference
In-Depth Information
(3) Stop when there is evidence that the next pair of clusters to be combined yields a bad
cluster. For example, we could track the average diameter of all the current clusters.
As long as we are combining points that truly belong in a cluster, this average will rise
gradually. However, if we combine two clusters that really don't deserve to be com-
bined, then the average diameter will take a sudden jump.
EXAMPLE 7.5 Let us reconsider Fig. 7.2 . It has three natural clusters. We computed the dia-
meter of the largest - the five points at the right - in Example 7.4 ; it is 4.24. The diameter
of the 3-node cluster at the lower left is 3, the distance between (2,2) and (5,2). The diamet-
er of the 4-node cluster at the upper left is The average diameter, 3.62, was reached
starting from 0 after nine mergers, so the rise is evidently slow: about 0.4 per merger.
If we are forced to merge two of these natural clusters, the best we can do is merge the
two at the left. The diameter of this cluster is that is the distance between the two
points (2,2) and (7,10). Now, the average of the diameters is (9 . 43 + 4 . 24) / 2 = 6 . 84. This av-
erage has jumped almost as much in one step as in all nine previous steps. That comparison
indicates that the last merger was inadvisable, and we should roll it back and stop.
7.2.4
Hierarchical Clustering in Non-Euclidean Spaces
When the space is non-Euclidean, we need to use some distance measure that is computed
from points, such as Jaccard, cosine, or edit distance. That is, we cannot base distances on
“location” of points. The algorithm of Section 7.2.1 requires distances between points to
be computed, but presumably we have a way to compute those distances. A problem arises
when we need to represent a cluster, because we cannot replace a collection of points by
their centroid.
EXAMPLE 7.6 The problem arises for any of the non-Euclidean distances we have dis-
cussed, but to be concrete, suppose we are using edit distance, and we decide to merge the
strings abcd and aecdb . These have edit distance 3 and might well be merged. However,
there is no string that represents their average, or that could be thought of as lying naturally
between them. We could take one of the strings that we might pass through when trans-
forming one string to the other by single insertions or deletions, such as aebcd , but there
are many such options. Moreover, when clusters are formed from more than two strings,
the notion of “on the path between” stops making sense.
Given that we cannot combine points in a cluster when the space is non-Euclidean, our
only choice is to pick one of the points of the cluster itself to represent the cluster. Ideally,
this point is close to all the points of the cluster, so it in some sense lies in the “center.”
Search WWH ::




Custom Search