Information Technology Reference
In-Depth Information
Fig. 16.1.
A validity index values obtained along different number of clusters
solutions provided by a clustering algorithm through all the possible
k
values
[24, 27, 12, 39]. Consider in Figure 16.1 the plot (that we call
index
)ofava-
lidity index against the different
k
values. Using an agglomerative algorithm,
once reaching the flat optimal solution at
k
=
a
, all the remaining actions (until
k
= 1) are obviously a time waste because we will end up by considering the
solution provided at
a
. Hence, finding a relevant stopping criterion is primordial.
In probabilistic clustering algorithms (mixture models), many stopping criteria
are defined quantifying the degree to which a model fits a dataset; among the
most known criteria, we can find the Bayesian Information Criterion (BIC) [10],
and the Minimum Description Length (MDL) criterion [28]. In non-probabilistic
algorithms - which are our concern in this chapter - stopping criteria rely in
most cases on input user parameters. For instance, in agglomerative algorithms,
these parameters can be a predefined number of clusters, a minimum similar-
ity between clusters, a maximum similarity gap between successive levels, etc.
This kind of stopping criteria has serious limitations since users often ignore the
parameters that best fit the datasets [13].
A promising approach to address this issue is to make use of relative indices
in order to develop an incremental agglomerative algorithm [8] able to stop
once reaching the “right” optimal solution in terms of a
VI
at
k
=
a
.Thus,
an intuitive approach is to let the clustering process go on while optimizing a
VI
in a stepwise fashion, and to stop once reaching a point (
k
=
b
) where no
further (significant) improvement
2
can be made with any (merging) action [25].
However, such an ad-hoc approach suffers from ignoring, at the specific level
b
, whether it has truly reached the optimal solution (i.e.,
a
=
b
), or a better
solution will come afterward if it accepts a quality decrease at
b
(which is the
case in Figure 16.1). The major problem is that validity indices are using too
much local information to take a global decision, e.g., stopping the process. As
one could notice, addressing the described issue is a tough and challenging task.
Along those lines, we developed a method that aims at enhancing agglomerative
algorithms with context-aware decisions taken along with validity indices.
2
While a significant improvement is required with indices scaling with
k
, a slight
improvement is enough for indices not scaling with
k
.