Exploring Validity Indices for Clustering Textual Data - Mining Complex Data

Information Technology Reference

In-Depth Information

The goal of any clustering algorithm seeking flat partitions of data is to find

the optimal clustering solution and the optimal number of clusters k . One natural

way to reach this without the need for parameters, is to evaluate the quality of

different clustering solutions along different number of clusters, in order to finally

choose the solution giving the best results. In cluster analysis, the procedure of

evaluating the results is known as cluster validation [16], and the indices that

aims at comparing different solutions with different parameters are known as

relative validity indices [12].

Relative indices can be involved in clustering methods in two different ways:

-

They can be involved as external indicators about the quality of the clustering

solution(s) provided by an algorithm.

-

They can be involved as criterion functions driving the entire clustering pro-

cess. In this case, algorithms are often called incremental [8] .

Involving a validity index as a criterion function leads usually to a very high

complexity. Indeed, this complexity depends on each validity index, and on the

type of algorithm that tends to optimize it. For instance, in agglomerative al-

gorithms, at each level of the process, instead of merging two clusters by means

of a 'classic' criterion function (e.g., by average-linkage), a validity index VI

is calculated evaluating the different solutions that could be obtained following

the different merging possibilities. The pair of clusters that will be chosen for

merging is the one that optimizes VI . Thus, at a level having k clusters, the

complexity of evaluating the different solutions is O ( k ( k

1) / 2);

In the literature, an ambiguity remains unresolved about the usage of validity

indices. What is sure, is that they are highly effective, since they seem to yield

the only way to bypass the need for input parameters in most clustering algo-

rithms. However, despite their central role, the real utility of validity indices was

neither fully exploited nor enough investigated in the literature, especially when

dealing with high-dimensional data like text. Along those lines, we provide two

main contributions that broadly aim at exploring and exploiting relative indices

in clustering algorithms. The applications of interest are document and word

clustering. More particularly, our contributions try to bring answers to the key

questions below.

−

-

How “good” are relative indices at detecting the optimal clustering solution?

Are they better used as criterion functions or simply as external indicators?

In case of criterion functions, which index is most likely to drive an agglom-

erative algorithm to an 'optimal partition'?

-

How reliably can we use validity indices as stopping criteria (to terminate

the process) in agglomerative algorithm? To which extent can we enhance

their reliability for such usage?

The remainder of this chapter is organized as follows. After an overview on

cluster validity indices in the next section, we describe in Section 16.3 our new

context-aware method. Relative indices are evaluated and compared in Section

16.4. Then, we evaluate our context-aware method in Section 16.5. We conclude

in Section 16.6 by summarizing and drawing some future trends.

Mining Complex Data

Search WWH ::

Custom Search

Home