Databases Reference
In-Depth Information
o i
o i
o i
o i
p
o j
o j
o j
o j
Data object
Cluster center
Before swapping
After swapping
p
p
o random
o random
o random
o random
p
(d) Reassigned
(a) Reassigned
(b) Reassigned
(c) No change
to o i
to o random
to o random
Figure10.4 Four cases of the cost function for k -medoids clustering.
Object o remains assigned to the cluster represented by o i as long as o is still closer to o i
than to o random (Figure 10.4c). Otherwise, o is reassigned to o random (Figure 10.4d).
Each time a reassignment occurs, a difference in absolute error, E , is contributed to
the cost function. Therefore, the cost function calculates the difference in absolute-error
value if a current representative object is replaced by a nonrepresentative object. The
total cost of swapping is the sum of costs incurred by all nonrepresentative objects. If
the total cost is negative, then o j is replaced or swapped with o random because the actual
absolute-error E is reduced. If the total cost is positive, the current representative object,
o j , is considered acceptable, and nothing is changed in the iteration.
“Which method is more robust—k-means or k-medoids?” The k -medoids method is
more robust than k -means in the presence of noise and outliers because a medoid is less
influenced by outliers or other extreme values than a mean. However, the complexity
of each iteration in the k -medoids algorithm is O
2
. For large values of n
and k , such computation becomes very costly, and much more costly than the k -means
method. Both methods require the user to specify k , the number of clusters.
“How can we scale up the k-medoids method?” A typical k -medoids partitioning algo-
rithm like PAM (Figure 10.5) works effectively for small data sets, but does not scale well
for large data sets. To deal with larger data sets, a sampling -based method called CLARA
( Clustering LARge Applications ) can be used. Instead of taking the whole data set into
consideration, CLARA uses a random sample of the data set. The PAM algorithm is then
applied to compute the best medoids from the sample. Ideally, the sample should closely
represent the original data set. In many cases, a large sample works well if it is created so
that each object has equal probability of being selected into the sample. The representa-
tive objects (medoids) chosen will likely be similar to those that would have been chosen
from the whole data set. CLARA builds clusterings from multiple random samples and
returns the best clustering as the output. The complexity of computing the medoids on
a random sample is O
.
k
.
n k
/
/
ks 2 C k
, where s is the size of the sample, k is the number
of clusters, and n is the total number of objects. CLARA can deal with larger data sets
than PAM.
The effectiveness of CLARA depends on the sample size. Notice that PAM searches
for the best k -medoids among a given data set, whereas CLARA searches for the best
k -medoids among the selected sample of the data set. CLARA cannot find a good
clustering if any of the best sampled medoids is far from the best k -medoids. If an object
.
.
n k
//
 
Search WWH ::




Custom Search