Exploring Validity Indices for Clustering Textual Data - Mining Complex Data

Information Technology Reference

In-Depth Information

In this section, we investigate, through an experimental study, the utility of

different relative indices for an agglomerative algorithm in two different con-

texts: Document and word clustering. In such contexts, many questions remain

unaddressed in the literature.

-

How “good” are relative indices at evaluating partitions? Is it worthwhile in-

volving relative indices as criterion functions in an agglomerative algorithm?

Or involving them as external indicators could be enough (leading to compa-

rable results with much lower complexity)? The goal here is of course to find

the best trade-off between effectiveness and eciency among the different

approaches.

-

Consider the case where relative indices are involved as criterion functions.

A key question arises: Which index will most likely guide the agglomerative

algorithm to the optimal clustering solution and to the optimal number of

clusters in each application?

Indices must indeed be evaluated according to their ability to identify both,

the optimal clustering solution and the optimal number of clusters. Note that

these two goals do not necessarily overlap; actually, since algorithms are error-

prone, an optimal solution can lie under a number of clusters different from

the “real” optimal number of clusters. Moreover, an algorithm can provide poor

solutions at the “real” optimal number of clusters. For this reason, we have

chosen to separate between the two concepts.

The rest of this section is organized as follows: We start by describing the two

benchmarks used for our experiments. Then, we present our evaluation method-

ology and the obtained results for both benchmarks in the following two subsec-

tions. We end up this section by discussing the results.

16.4.2

Benchmark for Document Clustering

The benchmark used for document clustering is constituted of two datasets,

whose general characteristics are summarized in Table 16.1. These are two dis-

tinct collections (no common document) extracted from the Reuters corpus 6 .Ba-

sically, the Reuters corpus contains over 800,000 manually categorized newswire

stories (documents), each of which consisting of few hundred up to several

thousand words. Each document has been manually categorized into one or

multiple topics, such as “Economics, Markets, Corporate/Industrial” . For our

experiments, documents are preprocessed by applying the classical techniques

of Natural Language Processing (NLP) provided by Gate 7 : Tokenization, stop-

word removal, POS tagging, words lemmatization. The Vector-Space Model is

used to represent each document d by a vector v in a multidimensional space,

where each dimension represents a word expressed by its tf.idf score [30]. Fi-

nally, a similarity is calculated between a pair of documents by means of the

cosines coe cient between their feature vectors.

6 Reuters corpus, volume 1 (RCV 1), English language, release date: 2000-11-03.

7 http://www.gate.ac.uk/

Mining Complex Data

Search WWH ::

Custom Search

Home