Cluster Analysis: Basic Concepts and Methods - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

the object to its cluster center is squared, and the distances are summed. This objective

function tries to make the resulting k clusters as compact and as separate as possible.

Optimizing the within-cluster variation is computationally challenging. In the worst

case, we would have to enumerate a number of possible partitionings that are exponen-

tial to the number of clusters, and check the within-cluster variation values. It has been

shown that the problem is NP-hard in general Euclidean space even for two clusters (i.e.,

k D 2). Moreover, the problem is NP-hard for a general number of clusters k even in the

2-D Euclidean space. If the number of clusters k and the dimensionality of the space d

are fixed, the problem can be solved in time O

n dk C1 log n

, where n is the number of

objects. To overcome the prohibitive computational cost for the exact solution, greedy

approaches are often used in practice. A prime example is the k -means algorithm, which

is simple and commonly used.

“How does the k- means algorithm work?” The k -means algorithm defines the centroid

of a cluster as the mean value of the points within the cluster. It proceeds as follows. First,

it randomly selects k of the objects in D , each of which initially represents a cluster mean

or center. For each of the remaining objects, an object is assigned to the cluster to which

it is the most similar, based on the Euclidean distance between the object and the cluster

mean. The k -means algorithm then iteratively improves the within-cluster variation.

For each cluster, it computes the new mean using the objects assigned to the cluster in

the previous iteration. All the objects are then reassigned using the updated means as

the new cluster centers. The iterations continue until the assignment is stable, that is,

the clusters formed in the current round are the same as those formed in the previous

round. The k -means procedure is summarized in Figure 10.2.

.

/

Algorithm: k - means. The k -means algorithm for partitioning, where each cluster's center

is represented by the mean value of the objects in the cluster.

Input:

k : the number of clusters,

D : a data set containing n objects.

Output: A set of k clusters.

Method:

(1) arbitrarily choose k objects from D as the initial cluster centers;

(2) repeat

(3) (re)assign each object to the cluster to which the object is the most similar,

based on the mean value of the objects in the cluster;

(4) update the cluster means, that is, calculate the mean value of the objects for

each cluster;

(5) until no change;

Figure10.2 The k -means partitioning algorithm.

Search WWH ::

Custom Search

Home