Databases Reference
In-Depth Information
the object to its cluster center is squared, and the distances are summed. This objective
function tries to make the resulting k clusters as compact and as separate as possible.
Optimizing the within-cluster variation is computationally challenging. In the worst
case, we would have to enumerate a number of possible partitionings that are exponen-
tial to the number of clusters, and check the within-cluster variation values. It has been
shown that the problem is NP-hard in general Euclidean space even for two clusters (i.e.,
k D 2). Moreover, the problem is NP-hard for a general number of clusters k even in the
2-D Euclidean space. If the number of clusters k and the dimensionality of the space d
are fixed, the problem can be solved in time O
n dk C1 log n
, where n is the number of
objects. To overcome the prohibitive computational cost for the exact solution, greedy
approaches are often used in practice. A prime example is the k -means algorithm, which
is simple and commonly used.
“How does the k- means algorithm work?” The k -means algorithm defines the centroid
of a cluster as the mean value of the points within the cluster. It proceeds as follows. First,
it randomly selects k of the objects in D , each of which initially represents a cluster mean
or center. For each of the remaining objects, an object is assigned to the cluster to which
it is the most similar, based on the Euclidean distance between the object and the cluster
mean. The k -means algorithm then iteratively improves the within-cluster variation.
For each cluster, it computes the new mean using the objects assigned to the cluster in
the previous iteration. All the objects are then reassigned using the updated means as
the new cluster centers. The iterations continue until the assignment is stable, that is,
the clusters formed in the current round are the same as those formed in the previous
round. The k -means procedure is summarized in Figure 10.2.
.
/
Algorithm: k - means. The k -means algorithm for partitioning, where each cluster's center
is represented by the mean value of the objects in the cluster.
Input:
k : the number of clusters,
D : a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster;
(4) update the cluster means, that is, calculate the mean value of the objects for
each cluster;
(5) until no change;
Figure10.2 The k -means partitioning algorithm.
 
Search WWH ::




Custom Search