Databases Reference
In-Depth Information
+
+
+
+
+
+
+
+
+
(a) Initial clustering
(b) Iterate
(c) Final clustering
Figure10.3 Clustering of a set of objects using the k -means method; for (b) update cluster centers and
reassign objects accordingly (the mean of each cluster is marked by a C).
Example10.1 Clustering by k -means partitioning. Consider a set of objects located in 2-D space,
as depicted in Figure 10.3(a). Let k D 3, that is, the user would like the objects to be
partitioned into three clusters.
According to the algorithm in Figure 10.2, we arbitrarily choose three objects as
the three initial cluster centers, where cluster centers are marked by a C. Each object
is assigned to a cluster based on the cluster center to which it is the nearest. Such a
distribution forms silhouettes encircled by dotted curves, as shown in Figure 10.3(a).
Next, the cluster centers are updated. That is, the mean value of each cluster is recal-
culated based on the current objects in the cluster. Using the new cluster centers, the
objects are redistributed to the clusters based on which cluster center is the nearest.
Such a redistribution forms new silhouettes encircled by dashed curves, as shown in
Figure 10.3(b).
This process iterates, leading to Figure 10.3(c). The process of iteratively reassigning
objects to clusters to improve the partitioning is referred to as iterative relocation . Even-
tually, no reassignment of the objects in any cluster occurs and so the process terminates.
The resulting clusters are returned by the clustering process.
The k -means method is not guaranteed to converge to the global optimum and often
terminates at a local optimum. The results may depend on the initial random selection
of cluster centers. (You will be asked to give an example to show this as an exercise.)
To obtain good results in practice, it is common to run the k -means algorithm multiple
times with different initial cluster centers.
The time complexity of the k -means algorithm is O
, where n is the total number
of objects, k is the number of clusters, and t is the number of iterations. Normally, k n
and t n . Therefore, the method is relatively scalable and efficient in processing large
data sets.
There are several variants of the k -means method. These can differ in the selection
of the initial k -means, the calculation of dissimilarity, and the strategies for calculating
cluster means.
.
nkt
/
 
Search WWH ::




Custom Search