Exploring Validity Indices for Clustering Textual Data - Mining Complex Data

Information Technology Reference

In-Depth Information

Fig. 16.1. A validity index values obtained along different number of clusters

solutions provided by a clustering algorithm through all the possible k values

[24, 27, 12, 39]. Consider in Figure 16.1 the plot (that we call index )ofava-

lidity index against the different k values. Using an agglomerative algorithm,

once reaching the flat optimal solution at k = a , all the remaining actions (until

k = 1) are obviously a time waste because we will end up by considering the

solution provided at a . Hence, finding a relevant stopping criterion is primordial.

In probabilistic clustering algorithms (mixture models), many stopping criteria

are defined quantifying the degree to which a model fits a dataset; among the

most known criteria, we can find the Bayesian Information Criterion (BIC) [10],

and the Minimum Description Length (MDL) criterion [28]. In non-probabilistic

algorithms - which are our concern in this chapter - stopping criteria rely in

most cases on input user parameters. For instance, in agglomerative algorithms,

these parameters can be a predefined number of clusters, a minimum similar-

ity between clusters, a maximum similarity gap between successive levels, etc.

This kind of stopping criteria has serious limitations since users often ignore the

parameters that best fit the datasets [13].

A promising approach to address this issue is to make use of relative indices

in order to develop an incremental agglomerative algorithm [8] able to stop

once reaching the “right” optimal solution in terms of a VI at k = a .Thus,

an intuitive approach is to let the clustering process go on while optimizing a

VI in a stepwise fashion, and to stop once reaching a point ( k = b ) where no

further (significant) improvement 2 can be made with any (merging) action [25].

However, such an ad-hoc approach suffers from ignoring, at the specific level

b , whether it has truly reached the optimal solution (i.e., a = b ), or a better

solution will come afterward if it accepts a quality decrease at b (which is the

case in Figure 16.1). The major problem is that validity indices are using too

much local information to take a global decision, e.g., stopping the process. As

one could notice, addressing the described issue is a tough and challenging task.

Along those lines, we developed a method that aims at enhancing agglomerative

algorithms with context-aware decisions taken along with validity indices.

2 While a significant improvement is required with indices scaling with k , a slight

improvement is enough for indices not scaling with k .

Search WWH ::

Custom Search

Home