Information Technology Reference
In-Depth Information
16.2
Cluster Validity Indices
There exist three kinds of cluster validity indices VI , namely internal, external,
and relative indices [16]: (1) External indices evaluate a clustering solution by
comparing it to an a-priori specified structure that reflects the desired result over
the dataset (e.g., FScore measure, entropy, Jaccard Coe cient, Rand Statistic ).
(2) Internal indices assess the intrinsic adequacy between the data structure
and the imposed solution basing on quantities and features extracted from the
dataset itself (e.g., CPCC, Hubert τ statistic ). (3) Relative indices compare a
clustering solution to another one obtained with different parameters. This can
help choosing the parameters that best fit the dataset. Relative indices tend to
maximize the intra-cluster compactness and the inter-cluster separation (e.g.,
DB, Dunn indices, C1..C4, S Dbw ).
Since our concern is broadly to find the optimal solution across different k
values, our focus in the following will be on relative validity indices. We can
distinguish two categories of them, depending on whether or not they scale with
the numbers of clusters [12]:
16.2.1
Relative Indices Scaling with the Number of Clusters
Some relative validity indices have the nature to follow systematically the trend
of the number of clusters k , which means that, as k increases, their values will
keep either increasing or decreasing. Thus, the definition of the optimal k cannot
rely on the maximum/minimum value of a VI . It is usually chosen by inspection
taking the plot having the more significant local change (jump or drop) in the
values of VI , appearing like a “knee” or an “elbow”. The intuition is that quick
jumps/drops are expected when we are still behind the optimal k ,andslower
jumps/drops are expected once reaching the optimal k . However, given the many
variations in the values of VI , it is often dicult and unclear in practice how
to identify the right “knee” in the curve. To overcome this shortcoming, two
approaches are widely used: The gap statistics [36] and the stability approach
[2, 20].
Among indices in this category, we can find: CH [35], Diff [19], the Hubert
τ statistic [31]. Another set of indices (i.e., I 1 ,I 2 ,E 1 ,H 1 ,H 2) are developed by
Zhao [38] specifically for document clustering purposes.
16.2.2
Relative Indices not Scaling with the Number of Clusters
Indices under this category do not systematically follow the trend of k .Inthis
case, the optimal k is more easily chosen as the point on the plot maximiz-
ing/minimizing VI . Among indices developed for generic clustering purposes,
we can cite: Dunn [9], the modified Dunn (m-dunn) [3], Davies-Bouldin (DB)
[7], RMSSDT, SPR, RS, CD [33], SD, S Dbw [12], SF [29] . Another bunch
of indices (i.e., C1, C2, C3, C4 ) were developed by Raskutti [27] for document
clustering purposes.
Search WWH ::




Custom Search