Information Technology Reference
In-Depth Information
In this section, we investigate, through an experimental study, the utility of
different relative indices for an agglomerative algorithm in two different con-
texts: Document and word clustering. In such contexts, many questions remain
unaddressed in the literature.
-
How “good” are relative indices at evaluating partitions? Is it worthwhile in-
volving relative indices as criterion functions in an agglomerative algorithm?
Or involving them as external indicators could be enough (leading to compa-
rable results with much lower complexity)? The goal here is of course to find
the best trade-off between effectiveness and eciency among the different
approaches.
-
Consider the case where relative indices are involved as criterion functions.
A key question arises: Which index will most likely guide the agglomerative
algorithm to the optimal clustering solution and to the optimal number of
clusters in each application?
Indices must indeed be evaluated according to their ability to identify both,
the optimal clustering solution and the optimal number of clusters. Note that
these two goals do not necessarily overlap; actually, since algorithms are error-
prone, an optimal solution can lie under a number of clusters different from
the “real” optimal number of clusters. Moreover, an algorithm can provide poor
solutions at the “real” optimal number of clusters. For this reason, we have
chosen to separate between the two concepts.
The rest of this section is organized as follows: We start by describing the two
benchmarks used for our experiments. Then, we present our evaluation method-
ology and the obtained results for both benchmarks in the following two subsec-
tions. We end up this section by discussing the results.
16.4.2
Benchmark for Document Clustering
The benchmark used for document clustering is constituted of two datasets,
whose general characteristics are summarized in Table 16.1. These are two dis-
tinct collections (no common document) extracted from the Reuters corpus 6 .Ba-
sically, the Reuters corpus contains over 800,000 manually categorized newswire
stories (documents), each of which consisting of few hundred up to several
thousand words. Each document has been manually categorized into one or
multiple topics, such as “Economics, Markets, Corporate/Industrial” . For our
experiments, documents are preprocessed by applying the classical techniques
of Natural Language Processing (NLP) provided by Gate 7 : Tokenization, stop-
word removal, POS tagging, words lemmatization. The Vector-Space Model is
used to represent each document d by a vector v in a multidimensional space,
where each dimension represents a word expressed by its tf.idf score [30]. Fi-
nally, a similarity is calculated between a pair of documents by means of the
cosines coe cient between their feature vectors.
6 Reuters corpus, volume 1 (RCV 1), English language, release date: 2000-11-03.
7 http://www.gate.ac.uk/
 
Search WWH ::




Custom Search