Databases Reference
In-Depth Information
their centroid (average), and at each step the clusters with the closest centroids
are merged.
Exercise 7.2.2 : How would the clustering of Example 7.2 change if we used
for the distance between two clusters:
(a) The minimum of the distances between any two points, one from each
cluster.
(b) The average of the distances between pairs of points, one from each of the
two clusters.
Exercise 7.2.3 : Repeat the clustering of Example 7.2 if we choose to merge
the two clusters whose resulting cluster has:
(a) The smallest radius.
(b) The smallest diameter.
Exercise 7.2.4 : Compute the density of each of the three clusters in Fig. 7.2,
if “density” is defined to be the number of points divided by
(a) The square of the radius.
(b) The diameter (not squared).
What are the densities, according to (a) and (b), of the clusters that result from
the merger of any two of these three clusters. Does the difference in densities
suggest the clusters should or should not be merged?
Exercise 7.2.5 : We can select clustroids for clusters, even if the space is
Euclidean. Consider the three natural clusters in Fig. 7.2, and compute the
clustroids of each, assuming the criterion for selecting the clustroid is the point
with the minimum sum of distances to the other point in the cluster.
! Exercise 7.2.6 : Consider the space of strings with edit distance as the distance
measure. Give an example of a set of strings such that if we choose the clustroid
by minimizing the sum of the distances to the other points we get one point
as the clustroid, but if we choose the clustroid by minimizing the maximum
distance to the other points, another point becomes the clustroid.
7.3
K-means Algorithms
In this section we begin the study of point-assignment algorithms. The best
known family of clustering algorithms of this type is called k-means. They
assume a Euclidean space, and they also assume the number of clusters, k,
is known in advance. It is, however, possible to deduce k by trial and error.
After an introduction to the family of k-means algorithms, we shall focus on a
particular algorithm, called BFR after its authors, that enables us to execute
k-means on data that is too large to fit in main memory.
Search WWH ::




Custom Search