Exploring Validity Indices for Clustering Textual Data - Mining Complex Data

Information Technology Reference

In-Depth Information

16.5

Evaluating Our Context-Aware Method

In this section, we present an experimental study, which is an attempt to answer

the following question: How reliable can be the usage relative indices as stopping

criteria in agglomerative clustering? ”. Along those lines, we explore the added-

value of enhancing a clustering process with context-awareness in order to enable

validity indices usage as stopping criteria. We evaluate the proposed method on

the four benchmarks described in Sections 16.4.2 and 16.4.3, for document and

word clustering respectively. We excluded from the following experiments some

indices that showed to be inappropriate for the context-aware method because

they provide too unstable curves to be stabilized (e.g., Dunn , m - Dunn ). At last,

our experiments will be carried out on 5 indices, namely DB [7], C1, C2, C4

[27], and H3 .

Therefore, the experiments include 10 algorithms after having run the agglom-

erative algorithm 2 times for each of the 5 relative indices (with and without

context-awareness). Each solution provided at each level of the clustering pro-

cess is evaluated by means of the target relative index (predicted quality) and

the FScore (real quality).

As stressed earlier in this chapter, the goal is to approach, as much as possible,

the solution provided before FD to the optimal solution. The optimal solution

is defined as being the solution at k where a specific VI reached its maximum

or minimum, depending on whether we tend to maximize or minimize VI .

16.5.1

On Approaching the Optimal Clustering Solution

We first study to which extent the context-aware method allows FD to approach

the optimal clustering solution reached under a specific number of clusters k .

Therefore, we demonstrate in Figures 16.9 and 16.10 the complete agglomerative

clustering process ( k = n

→

k = 1) divided into three parts:

-

P1: This part goes from the initial set ( k = n ) to the last point before FD .

Thus, using a VI as a stopping criterion will lead the process to the last

point of P1.

-

P2: This part goes from FD to the optimal clustering solution. It represents

the part that must be processed but will not if VI is used as a stopping

criterion.

-

P3: This part goes from the optimal solution until the root cluster ( k =1),

which forms the unnecessary part that will be performed in vain if VI is not

used as a stopping criterion.

By observing Figures 16.9 and 16.10, we can quickly notice the added-value of

the context-aware method for both applications word and document clustering.

On the first hand, it avoids a clustering algorithm from processing all the P3

parts which is a great time waste. On the other hand, it contributes to reduce

P2, since in most cases, FD occurs remarkably closer to the optimal solution.

This will surely enable us to consider more relevantly a solution before FD as the

Mining Complex Data

Search WWH ::

Custom Search

Home