Databases Reference
In-Depth Information
3. The radius of a cluster is the maximum distance between all the points
and the centroid. Combine the two clusters whose resulting cluster has
the lowest radius. A slight modification is to combine the clusters whose
result has the lowest average distance between a point and the centroid.
Another modification is to use the sum of the squares of the distances
between the points and the centroid. In some algorithms, we shall find
these variant definitions of “radius” referred to as “the radius.”
4. The diameter of a cluster is the maximum distance between any two
points of the cluster. Note that the radius and diameter of a cluster
are not related directly, as they are in a circle, but there is a tendency
for them to be proportional. We may choose to merge those clusters
whose resulting cluster has the smallest diameter. Variants of this rule,
analogous to the rule for radius, are possible.
Example 7.4 : Let us consider the cluster consisting of the five points at the
right of Fig. 7.2. The centroid of these five points is (10.8, 4.2). There is a tie
for the two furthest points from the centroid: (9,3) and (12,6), both at distance
4.68 = 2.16. Thus, the radius is 2.16. For the diameter, we find the two
points in the cluster having the greatest distance.
These are again (9,3) and
18 = 4.24, so that is the diameter. Notice that the
diameter is not exactly twice the radius, although it is close in this case. The
reason is that the centroid is not on the line between (9,3) and (12,6).
(12,6). Their distance is
2
We also have options in determining when to stop the merging process. We
already mentioned “stop when we have k clusters” for some predetermined k.
Here are some other options.
1. Stop if the diameter of the cluster that results from the best merger ex-
ceeds a threshold. We can also base this rule on the radius, or on any of
the variants of the radius mentioned above.
2. Stop if the density of the cluster that results from the best merger is
below some threshold. The density can be defined in many different ways.
Roughly, it should be the number of cluster points per unit volume of the
cluster. That ratio can be estimated by the number of points divided by
some power of the diameter or radius of the cluster. The correct power
could be the number of dimensions of the space. Sometimes, 1 or 2 is
chosen as the power, regardless of the number of dimensions.
3. Stop when there is evidence that the next pair of clusters to be combined
yields a bad cluster. For example, we could track the average diameter of
all the current clusters. As long as we are combining points that belong
in a cluster, this average will rise gradually. However, if we combine
two clusters that really don't deserve to be combined, then the average
diameter will take a sudden jump.
Search WWH ::




Custom Search