Databases Reference
In-Depth Information
given that 3.5 is the mean of clusterf1, 2, 3, 8gand 14.67 is the mean of clusterf9, 10, 25g.
The latter partitioning has the lowest within-cluster variation; therefore, the k -means
method assigns the value 8 to a cluster different from that containing 9 and 10 due to
the outlier point 25. Moreover, the center of the second cluster, 14.67, is substantially far
from all the members in the cluster.
“How can we modify the k-means algorithm to diminish such sensitivity to outliers?”
Instead of taking the mean value of the objects in a cluster as a reference point, we can
pick actual objects to represent the clusters, using one representative object per cluster.
Each remaining object is assigned to the cluster of which the representative object is
the most similar. The partitioning method is then performed based on the principle of
minimizing the sum of the dissimilarities between each object p and its corresponding
representative object. That is, an absolute-error criterion is used, defined as
k X
X
E D
dist
.
p , o i /
,
(10.2)
i D1
p 2 C i
where E is the sum of the absolute error for all objects p in the data set, and o i is the
representative object of C i . This is the basis for the k -medoids method , which groups n
objects into k clusters by minimizing the absolute error (Eq. 10.2).
When k D 1, we can find the exact median in O
n 2
time. However, when k is a
general positive number, the k -medoid problem is NP-hard.
The Partitioning Around Medoids (PAM) algorithm (see Figure 10.5 later) is a pop-
ular realization of k -medoids clustering. It tackles the problem in an iterative, greedy
way. Like the k -means algorithm, the initial representative objects (called seeds) are
chosen arbitrarily. We consider whether replacing a representative object by a nonrep-
resentative object would improve the clustering quality. All the possible replacements
are tried out. The iterative process of replacing representative objects by other objects
continues until the quality of the resulting clustering cannot be improved by any replace-
ment. This quality is measured by a cost function of the average dissimilarity between
an object and the representative object of its cluster.
Specifically, let o 1 ,
.
/
, o k be the current set of representative objects (i.e., medoids).
To determine whether a nonrepresentative object, denoted by o random , is a good replace-
ment for a current medoid o j .
:::
1 j k
/
, we calculate the distance from every
object p to
the
closest
object
in
the
set f o 1 ,
:::
, o j 1 , o random , o j C 1 ,
:::
, o k g,
and
use
the
distance
to
update
the
cost
function.
The
reassignments
of
objects
to
f o 1 ,
, o k g are simple. Suppose object p is currently assigned to
a cluster represented by medoid o j (Figure 10.4a or b). Do we need to reassign p to a
different cluster if o j is being replaced by o random ? Object p needs to be reassigned to
either o random or some other cluster represented by o i .
:::
, o j 1 , o random , o j C 1 ,
:::
, whichever is the closest.
For example, in Figure 10.4(a), p is closest to o i and therefore is reassigned to o i . In
Figure 10.4(b), however, p is closest to o random and so is reassigned to o random . What if,
instead, p is currently assigned to a cluster represented by some other object o i , i 6D j ?
i 6D j
/
 
Search WWH ::




Custom Search