Databases Reference
In-Depth Information
Example 7.5 : Let us reconsider Fig. 7.2. It has three natural clusters. We
computed the diameter of the largest - the five points at the right - in Ex-
ample 7.4; it is 4.24. The diameter of the 3-node cluster at the lower left is
3, the distance between (2,2) and (5,2). The diameter of the 4-node cluster at
the upper left is
13 = 3.61. The average diameter, 3.62, was reached starting
from 0 after nine mergers, so the rise is evidently slow: about 0.4 per merger.
If we are forced to merge two of these natural clusters, the best we can do
is merge the two at the left. The diameter of this cluster is
89 = 9.43; that is
the distance between the two points (2,2) and (7,10). Now, the average of the
diameters is (9.43 + 4.24)/2 = 6.84. This average has jumped almost as much
in one step as in all nine previous steps. That comparison indicates that the
last merger was inadvisable, and we should roll it back and stop.
2
7.2.4 Hierarchical Clustering in Non-Euclidean Spaces
When the space is non-Euclidean, we need to use some distance measure that
is computed from points, such as Jaccard, cosine, or edit distance. That is, we
cannot base distances on “location” of points. The algorithm of Section 7.2.1
requires distances between points to be computed, but presumably we have a
way to compute those distances. A problem arises when we need to represent
a cluster, because we cannot replace a collection of points by their centroid.
Example 7.6 : The problem arises for any of the non-Euclidean distances we
have discussed, but to be concrete, suppose we are using edit distance, and we
decide to merge the strings abcd and aecdb . These have edit distance 3 and
might well be merged. However, there is no string that represents their average,
or that could be thought of as lying naturally between them. We could take
one of the strings that we might pass through when transforming one string to
the other by single insertions or deletions, such as aebcd , but there are many
such options. Moreover, when clusters are formed from more than two strings,
the notion of “on the path between” stops making sense.
2
Given that we cannot combine points in a cluster when the space is non-
Euclidean, our only choice is to pick one of the points of the cluster itself to
represent the cluster. Ideally, this point is close to all the points of the cluster,
so it in some sense lies in the “center.” We call the representative point the
clustroid. We can select the clustroid in various ways, each designed to, in some
sense, minimize the distances between the clustroid and the other points in
the cluster. Common choices include selecting as the clustroid the point that
minimizes:
1. The sum of the distances to the other points in the cluster.
2. The maximum distance to another point in the cluster.
3. The sum of the squares of the distances to the other points in the cluster.
Search WWH ::




Custom Search