Biology Reference
In-Depth Information
µ
i
and
µ
j
are significantly different when
i
=
j
,
I
is an identity matrix of dimen-
sion
p
,and
K
is the number of underlying probability distributions generating the
dataset.
When clusters are not hyper-spheres but hyper-ellipsoids, which means the
variance-covariance matrix can not be expressed by Σ
i
=
σ
2
I
, taking Euclidean
distance as the dissimilarity measure has poor performance. Figure 5.1 shows two
points clustered incorrectly because of using Euclidean distance as the dissimi-
larity measure. In Fig. 5.1, we cluster a point into a subset whose center has the
minimal Euclidean distance to the point (how to find the centers is described in
K
-means clustering algorithm in Sec. 5.3). Obviously in Fig. 5.1, there are two
ellipsoid clusters. The centers of these two clusters are denoted as
C
1
and
C
2
on
Fig. 5.1. Consider two points
P
1
and
P
2
. Obviously, if we take into account the
ellipsoid nature of clusters,
P
1
should be clustered to subset centered at
C
1
,and
P
2
should be assigned to subset centered at
C
2
. However, since
P
1
has smaller
distance to
C
2
, if we take Euclidean distance as the dissimilarity measure,
P
1
is
clustered incorrectly into subset centered at
C
2
. Similarly,
P
2
is clustered incor-
rectly into subset centered at
C
1
.
Fig. 5.1.
Poor performance of Euclidean distance when clusters are ellipsoids.
Search WWH ::
Custom Search