Information Technology Reference
In-Depth Information
upper to the bottom part of the tree). An agglomerative algorithm begins with
n subclusters, each of which contains a single data point, and then merges the two
most similar groups at each stage to form a new cluster, thus reducing the number of
clusters by one. The algorithm proceeds until all the data fall within a single cluster. A
divisive algorithm starts with a single cluster composed by all the given objects and
keeps splitting the clusters based on some criterion, continuing until a partition of
n singleton clusters is obtained. There is a numerical value associated with each
position in the tree where branches join (the distance or dissimilarity between two
merged clusters).
A clustering of the data items into disjoint groups is obtained by sectioning the
partition tree at a desired level. For instance, the dashed line in Fig.
1.5
at the
fourth level of merging yields a clustering of three groups. This defines a data
partitioning in a number of clusters of comparable homogeneity. The procedures
used to find the partitioning that best fits the underlying data are called cluster
validation techniques [
39
,
40
]. In general, there are three types of criteria for
evaluating of the resulting clustering structure: external criteria (a pre-specified
data structure that reflects our intuition about the clustering structure); internal
criteria (quantities estimated from the clusters); and relative criteria (comparing
with other structures resulting from the same algorithm but with different
parameter values). There are several indices proposed to implement the cluster
validity techniques; they are based on measuring the compactness and separation
of the clusters. Some examples of cluster validity indices are: cophenetic corre-
lation coefficient for comparison of proximity matrices (pairwise data distances) of
the resulting clustering with a known independent partition; cluster data dispersion
to pairwise cluster dissimilarity measure ratio; root-mean-square standard devia-
tion of new merged clusters; and average scattering for clusters [
40
]. Specifically,
we have used the partition (PC) and partition entropy (PE) coefficients used from
fuzzy clustering [
41
] to assess the partitioning obtained by the proposed hierar-
X
N
X
nc
PC
¼
1
N
u
ij
ð
1
:
9
Þ
i
¼
1
j
¼
1
X
X
N
nc
PE
¼
1
N
u
ij
log
a
ð
u
ij
Þ
ð
1
:
10
Þ
i
¼
1
j
¼
1
where u
ij
denotes the degree of membership (posterior probability) of the data x
i
in
the j. The PC index values range from 1
=
nc to 1, where nc is the number of
clusters. The closer to unity the index is, the ''crisper'' the clustering is. A PC
value close to 1
=
nc indicates that there is no clustering tendency. The PE index
values range from 0 to log
a
ð
nc
Þ
. The closer the value of PE to 0 is, the harder the
clustering is. As for the PC index, the values of PE close to the upper bound
ð
log
a
ð
nc
ÞÞ
indicate the absence of any clustering structure in the dataset or the
inability of the algorithm to extract it.
Search WWH ::
Custom Search