Information Technology Reference
In-Depth Information
The goal of any clustering algorithm seeking flat partitions of data is to find
the optimal clustering solution and the optimal number of clusters k . One natural
way to reach this without the need for parameters, is to evaluate the quality of
different clustering solutions along different number of clusters, in order to finally
choose the solution giving the best results. In cluster analysis, the procedure of
evaluating the results is known as cluster validation [16], and the indices that
aims at comparing different solutions with different parameters are known as
relative validity indices [12].
Relative indices can be involved in clustering methods in two different ways:
-
They can be involved as external indicators about the quality of the clustering
solution(s) provided by an algorithm.
-
They can be involved as criterion functions driving the entire clustering pro-
cess. In this case, algorithms are often called incremental [8] .
Involving a validity index as a criterion function leads usually to a very high
complexity. Indeed, this complexity depends on each validity index, and on the
type of algorithm that tends to optimize it. For instance, in agglomerative al-
gorithms, at each level of the process, instead of merging two clusters by means
of a 'classic' criterion function (e.g., by average-linkage), a validity index VI
is calculated evaluating the different solutions that could be obtained following
the different merging possibilities. The pair of clusters that will be chosen for
merging is the one that optimizes VI . Thus, at a level having k clusters, the
complexity of evaluating the different solutions is O ( k ( k
1) / 2);
In the literature, an ambiguity remains unresolved about the usage of validity
indices. What is sure, is that they are highly effective, since they seem to yield
the only way to bypass the need for input parameters in most clustering algo-
rithms. However, despite their central role, the real utility of validity indices was
neither fully exploited nor enough investigated in the literature, especially when
dealing with high-dimensional data like text. Along those lines, we provide two
main contributions that broadly aim at exploring and exploiting relative indices
in clustering algorithms. The applications of interest are document and word
clustering. More particularly, our contributions try to bring answers to the key
questions below.
-
How “good” are relative indices at detecting the optimal clustering solution?
Are they better used as criterion functions or simply as external indicators?
In case of criterion functions, which index is most likely to drive an agglom-
erative algorithm to an 'optimal partition'?
-
How reliably can we use validity indices as stopping criteria (to terminate
the process) in agglomerative algorithm? To which extent can we enhance
their reliability for such usage?
The remainder of this chapter is organized as follows. After an overview on
cluster validity indices in the next section, we describe in Section 16.3 our new
context-aware method. Relative indices are evaluated and compared in Section
16.4. Then, we evaluate our context-aware method in Section 16.5. We conclude
in Section 16.6 by summarizing and drawing some future trends.
 
Search WWH ::




Custom Search