Cluster Analysis: Basic Concepts and Methods - Data Mining: Concepts and Techniques - page 456

Databases Reference

In-Depth Information

o i

o i

o i

o i

p

o j

o j

o j

o j

Data object

Cluster center

Before swapping

After swapping

p

p

o random

o random

o random

o random

p

(d) Reassigned

(a) Reassigned

(b) Reassigned

(c) No change

to o i

to o random

to o random

Figure10.4 Four cases of the cost function for k -medoids clustering.

Object o remains assigned to the cluster represented by o i as long as o is still closer to o i

than to o random (Figure 10.4c). Otherwise, o is reassigned to o random (Figure 10.4d).

Each time a reassignment occurs, a difference in absolute error, E , is contributed to

the cost function. Therefore, the cost function calculates the difference in absolute-error

value if a current representative object is replaced by a nonrepresentative object. The

total cost of swapping is the sum of costs incurred by all nonrepresentative objects. If

the total cost is negative, then o j is replaced or swapped with o random because the actual

absolute-error E is reduced. If the total cost is positive, the current representative object,

o j , is considered acceptable, and nothing is changed in the iteration.

“Which method is more robust—k-means or k-medoids?” The k -medoids method is

more robust than k -means in the presence of noise and outliers because a medoid is less

influenced by outliers or other extreme values than a mean. However, the complexity

of each iteration in the k -medoids algorithm is O

2

. For large values of n

and k , such computation becomes very costly, and much more costly than the k -means

method. Both methods require the user to specify k , the number of clusters.

“How can we scale up the k-medoids method?” A typical k -medoids partitioning algo-

rithm like PAM (Figure 10.5) works effectively for small data sets, but does not scale well

for large data sets. To deal with larger data sets, a sampling -based method called CLARA

( Clustering LARge Applications ) can be used. Instead of taking the whole data set into

consideration, CLARA uses a random sample of the data set. The PAM algorithm is then

applied to compute the best medoids from the sample. Ideally, the sample should closely

represent the original data set. In many cases, a large sample works well if it is created so

that each object has equal probability of being selected into the sample. The representa-

tive objects (medoids) chosen will likely be similar to those that would have been chosen

from the whole data set. CLARA builds clusterings from multiple random samples and

returns the best clustering as the output. The complexity of computing the medoids on

a random sample is O

.

k

.

n k

/

/

ks 2 C k

, where s is the size of the sample, k is the number

of clusters, and n is the total number of objects. CLARA can deal with larger data sets

than PAM.

The effectiveness of CLARA depends on the sample size. Notice that PAM searches

for the best k -medoids among a given data set, whereas CLARA searches for the best

k -medoids among the selected sample of the data set. CLARA cannot find a good

clustering if any of the best sampled medoids is far from the best k -medoids. If an object

.

.

n k

//

Next Page

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home