Biology Reference
In-Depth Information
µ i and µ j are significantly different when i
= j , I is an identity matrix of dimen-
sion p ,and K is the number of underlying probability distributions generating the
dataset.
When clusters are not hyper-spheres but hyper-ellipsoids, which means the
variance-covariance matrix can not be expressed by Σ i = σ 2 I , taking Euclidean
distance as the dissimilarity measure has poor performance. Figure 5.1 shows two
points clustered incorrectly because of using Euclidean distance as the dissimi-
larity measure. In Fig. 5.1, we cluster a point into a subset whose center has the
minimal Euclidean distance to the point (how to find the centers is described in
K -means clustering algorithm in Sec. 5.3). Obviously in Fig. 5.1, there are two
ellipsoid clusters. The centers of these two clusters are denoted as C 1 and C 2 on
Fig. 5.1. Consider two points P 1 and P 2 . Obviously, if we take into account the
ellipsoid nature of clusters, P 1 should be clustered to subset centered at C 1 ,and
P 2 should be assigned to subset centered at C 2 . However, since P 1 has smaller
distance to C 2 , if we take Euclidean distance as the dissimilarity measure, P 1 is
clustered incorrectly into subset centered at C 2 . Similarly, P 2 is clustered incor-
rectly into subset centered at C 1 .
Fig. 5.1.
Poor performance of Euclidean distance when clusters are ellipsoids.
Search WWH ::




Custom Search