Information Technology Reference
In-Depth Information
Fig. 16.1. A validity index values obtained along different number of clusters
solutions provided by a clustering algorithm through all the possible k values
[24, 27, 12, 39]. Consider in Figure 16.1 the plot (that we call index )ofava-
lidity index against the different k values. Using an agglomerative algorithm,
once reaching the flat optimal solution at k = a , all the remaining actions (until
k = 1) are obviously a time waste because we will end up by considering the
solution provided at a . Hence, finding a relevant stopping criterion is primordial.
In probabilistic clustering algorithms (mixture models), many stopping criteria
are defined quantifying the degree to which a model fits a dataset; among the
most known criteria, we can find the Bayesian Information Criterion (BIC) [10],
and the Minimum Description Length (MDL) criterion [28]. In non-probabilistic
algorithms - which are our concern in this chapter - stopping criteria rely in
most cases on input user parameters. For instance, in agglomerative algorithms,
these parameters can be a predefined number of clusters, a minimum similar-
ity between clusters, a maximum similarity gap between successive levels, etc.
This kind of stopping criteria has serious limitations since users often ignore the
parameters that best fit the datasets [13].
A promising approach to address this issue is to make use of relative indices
in order to develop an incremental agglomerative algorithm [8] able to stop
once reaching the “right” optimal solution in terms of a VI at k = a .Thus,
an intuitive approach is to let the clustering process go on while optimizing a
VI in a stepwise fashion, and to stop once reaching a point ( k = b ) where no
further (significant) improvement 2 can be made with any (merging) action [25].
However, such an ad-hoc approach suffers from ignoring, at the specific level
b , whether it has truly reached the optimal solution (i.e., a = b ), or a better
solution will come afterward if it accepts a quality decrease at b (which is the
case in Figure 16.1). The major problem is that validity indices are using too
much local information to take a global decision, e.g., stopping the process. As
one could notice, addressing the described issue is a tough and challenging task.
Along those lines, we developed a method that aims at enhancing agglomerative
algorithms with context-aware decisions taken along with validity indices.
2 While a significant improvement is required with indices scaling with k , a slight
improvement is enough for indices not scaling with k .
 
Search WWH ::




Custom Search