Cluster Analysis: Basic Concepts and Methods - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

given that 3.5 is the mean of clusterf1, 2, 3, 8gand 14.67 is the mean of clusterf9, 10, 25g.

The latter partitioning has the lowest within-cluster variation; therefore, the k -means

method assigns the value 8 to a cluster different from that containing 9 and 10 due to

the outlier point 25. Moreover, the center of the second cluster, 14.67, is substantially far

from all the members in the cluster.

“How can we modify the k-means algorithm to diminish such sensitivity to outliers?”

Instead of taking the mean value of the objects in a cluster as a reference point, we can

pick actual objects to represent the clusters, using one representative object per cluster.

Each remaining object is assigned to the cluster of which the representative object is

the most similar. The partitioning method is then performed based on the principle of

minimizing the sum of the dissimilarities between each object p and its corresponding

representative object. That is, an absolute-error criterion is used, defined as

k X

E D

dist

p , o i /

(10.2)

i D1

p 2 C i

where E is the sum of the absolute error for all objects p in the data set, and o i is the

representative object of C i . This is the basis for the k -medoids method , which groups n

objects into k clusters by minimizing the absolute error (Eq. 10.2).

When k D 1, we can find the exact median in O

n 2

time. However, when k is a

general positive number, the k -medoid problem is NP-hard.

The Partitioning Around Medoids (PAM) algorithm (see Figure 10.5 later) is a pop-

ular realization of k -medoids clustering. It tackles the problem in an iterative, greedy

way. Like the k -means algorithm, the initial representative objects (called seeds) are

chosen arbitrarily. We consider whether replacing a representative object by a nonrep-

resentative object would improve the clustering quality. All the possible replacements

are tried out. The iterative process of replacing representative objects by other objects

continues until the quality of the resulting clustering cannot be improved by any replace-

ment. This quality is measured by a cost function of the average dissimilarity between

an object and the representative object of its cluster.

Specifically, let o 1 ,

, o k be the current set of representative objects (i.e., medoids).

To determine whether a nonrepresentative object, denoted by o random , is a good replace-

ment for a current medoid o j .

:::

1 j k

, we calculate the distance from every

object p to

the

closest

object

the

set f o 1 ,

:::

, o j 1 , o random , o j C 1 ,

:::

, o k g,

and

use

the

distance

update

the

cost

function.

The

reassignments

objects

f o 1 ,

, o k g are simple. Suppose object p is currently assigned to

a cluster represented by medoid o j (Figure 10.4a or b). Do we need to reassign p to a

different cluster if o j is being replaced by o random ? Object p needs to be reassigned to

either o random or some other cluster represented by o i .

:::

, o j 1 , o random , o j C 1 ,

:::

, whichever is the closest.

For example, in Figure 10.4(a), p is closest to o i and therefore is reassigned to o i . In

Figure 10.4(b), however, p is closest to o random and so is reassigned to o random . What if,

instead, p is currently assigned to a cluster represented by some other object o i , i 6D j ?

i 6D j

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home