Clustering - Mining of Massive Datasets

Database Reference

In-Depth Information

the clustroid by minimizing the maximum distance to the other points, another point be-

comes the clustroid.

7.3 K-means Algorithms

In this section we begin the study of point-assignment algorithms. The best known family

of clustering algorithms of this type is called k -means. They assume a Euclidean space, and

they also assume the number of clusters, k , is known in advance. It is, however, possible to

deduce k by trial and error. After an introduction to the family of k -means algorithms, we

shall focus on a particular algorithm, called BFR after its authors, that enables us to execute

k -means on data that is too large to fit in main memory.

7.3.1

K-Means Basics

A k -means algorithm is outlined in Fig. 7.7 . There are several ways to select the initial

k points that represent the clusters, and we shall discuss them in Section 7.3.2 . The heart

of the algorithm is the for-loop, in which we consider each point other than the k selec-

ted points and assign it to the closest cluster, where “closest” means closest to the centroid

of the cluster. Note that the centroid of a cluster can migrate as points are assigned to it.

However, since only points near the cluster are likely to be assigned, the centroid tends not

to move too much.

Figure 7.7 Outline of k -means algorithms

An optional step at the end is to fix the centroids of the clusters and to reassign each

point, including the k initial points, to the k clusters. Usually, a point p will be assigned to

the same cluster in which it was placed on the first pass. However, there are cases where

the centroid of p 's original cluster moved quite far from p after p was placed there, and p is

assigned to a different cluster on the second pass. In fact, even some of the original k points

could wind up being reassigned. As these examples are unusual, we shall not dwell on the

subject.

Search WWH ::

Custom Search

Home