Graphics Reference
In-Depth Information
he most popular group of partitioning cluster algorithms are probably the cen-
troid-based algorithms, like K-means (MacQueen, ; Hartigan and Wong, )
orpartitioningaroundmedoids(PAM,KaufmanandRousseeuw, ).Assumethat
we are given a data set X N
=
x ,..., x N
and a set of centroids C K
=
c ,..., c K
.
Let d
(
x , y
)
again denote the distance between two points x and y ,let
c
(
x
)=
argmin
c C K
d
(
x , c
)
( . )
denote the centroid closest to x ,andlet
A k
=
x n
c
(
x n
)=
c k
( . )
be the set of all points where c k is the closest centroid. For simplicity of notation we
remove all empty clusters such that
,...,K.Mostclusteralgorithms
will try to find a set of centroids C K for fixed K such that the average distance
A k
,
k
=
N
n =
N
D
(
X N , C K
)=
d
(
x n , c
(
x n
))
min
C K
,
( . )
of each point to the closest centroid is minimized. However, for the following visu-
alization techniques it is not important whether such an optimum has actually been
reached,andthe choice ofthe centroid-based clusteralgorithm isalso not important.
Convex Cluster Hulls
11.3.1
Once a suitable set of centroids has been found, it is usually of interest to explore
how the centroids partition the input space, either in terms of the original data set
or in orderto enable predictions for new data. For visualization, one usually projects
thedata intotwo dimensions, usingprincipalcomponent analysis forexample. Pison
et al. ( ) use spanning ellipses to visually mark the area each cluster occupies in
aPCAplot.
When clustering nonGaussian data and/or using distances other than the Eu-
clidean distance, ellipses can be a misleading representation of cluster regions, be-
cause clusters may have arbitrary convex shapes, where the term convex relates to
the distance measure used.Ifclustersare projected into two-dimensional space,then
bagplots (Rousseeuw et al., ) can be used as a nonparametric alternative to el-
lipses.
For data partitioned using a centroid-based cluster algorithm, another natural
choice is to use the distance d
of each point from its respective cluster cen-
troid to define the inner and outer cluster areas. Let
(
x , c
(
x
))
m k
=
median
d
(
x n , c k
)
x n
A k
( . )
be the median distance of all points in cluster k to c k . We defines the inner area of
a clusterbythe convex hull of all data points where d
m k ;thiscorresponds
to the box in a boxplot. he outer area of a cluster is defined as the convex hull of
(
x n , c k
)
Search WWH ::




Custom Search