Exploring Validity Indices for Clustering Textual Data - Mining Complex Data

Information Technology Reference

In-Depth Information

16.2

Cluster Validity Indices

There exist three kinds of cluster validity indices VI , namely internal, external,

and relative indices [16]: (1) External indices evaluate a clustering solution by

comparing it to an a-priori specified structure that reflects the desired result over

the dataset (e.g., FScore measure, entropy, Jaccard Coe cient, Rand Statistic ).

(2) Internal indices assess the intrinsic adequacy between the data structure

and the imposed solution basing on quantities and features extracted from the

dataset itself (e.g., CPCC, Hubert τ statistic ). (3) Relative indices compare a

clustering solution to another one obtained with different parameters. This can

help choosing the parameters that best fit the dataset. Relative indices tend to

maximize the intra-cluster compactness and the inter-cluster separation (e.g.,

DB, Dunn indices, C1..C4, S Dbw ).

Since our concern is broadly to find the optimal solution across different k

values, our focus in the following will be on relative validity indices. We can

distinguish two categories of them, depending on whether or not they scale with

the numbers of clusters [12]:

16.2.1

Relative Indices Scaling with the Number of Clusters

Some relative validity indices have the nature to follow systematically the trend

of the number of clusters k , which means that, as k increases, their values will

keep either increasing or decreasing. Thus, the definition of the optimal k cannot

rely on the maximum/minimum value of a VI . It is usually chosen by inspection

taking the plot having the more significant local change (jump or drop) in the

values of VI , appearing like a “knee” or an “elbow”. The intuition is that quick

jumps/drops are expected when we are still behind the optimal k ,andslower

jumps/drops are expected once reaching the optimal k . However, given the many

variations in the values of VI , it is often dicult and unclear in practice how

to identify the right “knee” in the curve. To overcome this shortcoming, two

approaches are widely used: The gap statistics [36] and the stability approach

[2, 20].

Among indices in this category, we can find: CH [35], Diff [19], the Hubert

τ statistic [31]. Another set of indices (i.e., I 1 ,I 2 ,E 1 ,H 1 ,H 2) are developed by

Zhao [38] specifically for document clustering purposes.

16.2.2

Relative Indices not Scaling with the Number of Clusters

Indices under this category do not systematically follow the trend of k .Inthis

case, the optimal k is more easily chosen as the point on the plot maximiz-

ing/minimizing VI . Among indices developed for generic clustering purposes,

we can cite: Dunn [9], the modified Dunn (m-dunn) [3], Davies-Bouldin (DB)

[7], RMSSDT, SPR, RS, CD [33], SD, S Dbw [12], SF [29] . Another bunch

of indices (i.e., C1, C2, C3, C4 ) were developed by Raskutti [27] for document

clustering purposes.

Mining Complex Data

Search WWH ::

Custom Search

Home