Database Reference
In-Depth Information
the clustroid by minimizing the maximum distance to the other points, another point be-
comes the clustroid.
7.3 K-means Algorithms
In this section we begin the study of point-assignment algorithms. The best known family
of clustering algorithms of this type is called k -means. They assume a Euclidean space, and
they also assume the number of clusters, k , is known in advance. It is, however, possible to
deduce k by trial and error. After an introduction to the family of k -means algorithms, we
shall focus on a particular algorithm, called BFR after its authors, that enables us to execute
k -means on data that is too large to fit in main memory.
7.3.1
K-Means Basics
A k -means algorithm is outlined in Fig. 7.7 . There are several ways to select the initial
k points that represent the clusters, and we shall discuss them in Section 7.3.2 . The heart
of the algorithm is the for-loop, in which we consider each point other than the k selec-
ted points and assign it to the closest cluster, where “closest” means closest to the centroid
of the cluster. Note that the centroid of a cluster can migrate as points are assigned to it.
However, since only points near the cluster are likely to be assigned, the centroid tends not
to move too much.
Figure 7.7 Outline of k -means algorithms
An optional step at the end is to fix the centroids of the clusters and to reassign each
point, including the k initial points, to the k clusters. Usually, a point p will be assigned to
the same cluster in which it was placed on the first pass. However, there are cases where
the centroid of p 's original cluster moved quite far from p after p was placed there, and p is
assigned to a different cluster on the second pass. In fact, even some of the original k points
could wind up being reassigned. As these examples are unusual, we shall not dwell on the
subject.
Search WWH ::




Custom Search