Information Technology Reference
In-Depth Information
upper to the bottom part of the tree). An agglomerative algorithm begins with
n subclusters, each of which contains a single data point, and then merges the two
most similar groups at each stage to form a new cluster, thus reducing the number of
clusters by one. The algorithm proceeds until all the data fall within a single cluster. A
divisive algorithm starts with a single cluster composed by all the given objects and
keeps splitting the clusters based on some criterion, continuing until a partition of
n singleton clusters is obtained. There is a numerical value associated with each
position in the tree where branches join (the distance or dissimilarity between two
merged clusters).
A clustering of the data items into disjoint groups is obtained by sectioning the
partition tree at a desired level. For instance, the dashed line in Fig. 1.5 at the
fourth level of merging yields a clustering of three groups. This defines a data
partitioning in a number of clusters of comparable homogeneity. The procedures
used to find the partitioning that best fits the underlying data are called cluster
validation techniques [ 39 , 40 ]. In general, there are three types of criteria for
evaluating of the resulting clustering structure: external criteria (a pre-specified
data structure that reflects our intuition about the clustering structure); internal
criteria (quantities estimated from the clusters); and relative criteria (comparing
with other structures resulting from the same algorithm but with different
parameter values). There are several indices proposed to implement the cluster
validity techniques; they are based on measuring the compactness and separation
of the clusters. Some examples of cluster validity indices are: cophenetic corre-
lation coefficient for comparison of proximity matrices (pairwise data distances) of
the resulting clustering with a known independent partition; cluster data dispersion
to pairwise cluster dissimilarity measure ratio; root-mean-square standard devia-
tion of new merged clusters; and average scattering for clusters [ 40 ]. Specifically,
we have used the partition (PC) and partition entropy (PE) coefficients used from
fuzzy clustering [ 41 ] to assess the partitioning obtained by the proposed hierar-
chical clustering algorithm of Chap. 4 .
X
N
X
nc
PC ¼ 1
N
u ij
ð 1 : 9 Þ
i ¼ 1
j ¼ 1
X
X
N
nc
PE ¼ 1
N
u ij log a ð u ij Þ
ð 1 : 10 Þ
i ¼ 1
j ¼ 1
where u ij denotes the degree of membership (posterior probability) of the data x i in
the j. The PC index values range from 1 = nc to 1, where nc is the number of
clusters. The closer to unity the index is, the ''crisper'' the clustering is. A PC
value close to 1 = nc indicates that there is no clustering tendency. The PE index
values range from 0 to log a ð nc Þ . The closer the value of PE to 0 is, the harder the
clustering is. As for the PC index, the values of PE close to the upper bound
ð log a ð nc ÞÞ indicate the absence of any clustering structure in the dataset or the
inability of the algorithm to extract it.
Search WWH ::




Custom Search