Exploring Validity Indices for Clustering Textual Data - Mining Complex Data

Information Technology Reference

In-Depth Information

Due to their facility of interpretation, we choose to focus on indices not scaling

with k . The other reason is related to our primary goal which is to enable the

usage of relative indices as stopping criteria. This is definitely a hard task that

will get much harder if the optimal k must be selected using relatively sophis-

ticated techniques like the “knee” detection, the gap statistics, or the stability

approach.

A known drawback when involving validity indices in clustering is the com-

putational cost that quickly becomes prohibitive when scaling to large and high-

dimensional datasets. The main reason is that pairwise similarities between the

dataset elements/clusters have to be calculated. As a preliminary attempt to

reduce complexity, we propose a new validity index, H 3, that we define as

follows 1 .

H 3= i =1 n i . n i

j =1 sim ( e j ,S i )

( i =1 sim ( S i ,S )) /k

where sim denotes the similarity between two objects, S i denotes the centroid

of cluster C i containing n i elements, e j denotes a data element, and S denotes

the collection's centroid which is the average vector of all clusters' centroids.

H 3 is significantly less expensive than other indices. The reason is that H 3

deals with centroids to calculate the inter-cluster separation and the intra-cluster

compactness. It has a linear complexity of O ( k + n ). H 3 is inspired from the H 1

and H 2 indices proposed by Zhao [38]. The difference is that H 3 does not follow

the trend of k after having removed its sensitivity to k in an ad-hoc manner.

As a matter of fact, the intra-cluster similarity decreases as k decreases, thus

the quality of clustering continuously deteriorates from an intra-cluster point

of view. We consider that an optimal partition is reached, when the average of

inter-cluster similarities, that tends to improve while grouping similar objects,

is no more able to overwhelm the intra-cluster deterioration.

16.3

A Method for Enhancing Relative Indices Usage as

Stopping Criteria

In this section, we propose a method that aims to reduce the complexity of

clustering algorithms when involving relative indices as criterion functions. One

potential option is to use these indices to terminate the clustering process at a

point where the “optimal” solution is reached, which allows to discard all the

remaining unnecessary part of the process.

16.3.1

Problem Definition

The classical usage of relative validity indices for determining the k yielding

the optimal clustering solution comes a-posteriori , after evaluating the different

1 Evaluation of the H 3 index is out of scope in this chapter.

Search WWH ::

Custom Search

Home