Clustering - Mining of Massive Datasets

Databases Reference

In-Depth Information

their centroid (average), and at each step the clusters with the closest centroids

are merged.

Exercise 7.2.2 : How would the clustering of Example 7.2 change if we used

for the distance between two clusters:

(a) The minimum of the distances between any two points, one from each

cluster.

(b) The average of the distances between pairs of points, one from each of the

two clusters.

Exercise 7.2.3 : Repeat the clustering of Example 7.2 if we choose to merge

the two clusters whose resulting cluster has:

(a) The smallest radius.

(b) The smallest diameter.

Exercise 7.2.4 : Compute the density of each of the three clusters in Fig. 7.2,

if “density” is defined to be the number of points divided by

(a) The square of the radius.

(b) The diameter (not squared).

What are the densities, according to (a) and (b), of the clusters that result from

the merger of any two of these three clusters. Does the difference in densities

suggest the clusters should or should not be merged?

Exercise 7.2.5 : We can select clustroids for clusters, even if the space is

Euclidean. Consider the three natural clusters in Fig. 7.2, and compute the

clustroids of each, assuming the criterion for selecting the clustroid is the point

with the minimum sum of distances to the other point in the cluster.

! Exercise 7.2.6 : Consider the space of strings with edit distance as the distance

measure. Give an example of a set of strings such that if we choose the clustroid

by minimizing the sum of the distances to the other points we get one point

as the clustroid, but if we choose the clustroid by minimizing the maximum

distance to the other points, another point becomes the clustroid.

7.3

K-means Algorithms

In this section we begin the study of point-assignment algorithms. The best

known family of clustering algorithms of this type is called k-means. They

assume a Euclidean space, and they also assume the number of clusters, k,

is known in advance. It is, however, possible to deduce k by trial and error.

After an introduction to the family of k-means algorithms, we shall focus on a

particular algorithm, called BFR after its authors, that enables us to execute

k-means on data that is too large to fit in main memory.

Search WWH ::

Custom Search

Home