Information Technology Reference
In-Depth Information
16
Exploring Validity Indices
for Clustering Textual Data
AhmadElSayed,HakimHacid,andDjamelZighed
University of Lyon
ERIC Laboratory- 5, avenue Pierre Mendes-France
69676 Bron cedex - France
{ asayed, hhacid, dzighed } @eric.univ-lyon2.fr
Abstract. The goal of any clustering algorithm producing flat partitions of data, is
to find both the optimal clustering solution and the optimal number of clusters. One
natural way to reach this goal without the need for parameters, is to involve a validity
index in a clustering process, which can lead to an objective selection of the optimal
number of clusters. In this chapter, we provide two main contributions. Firstly, since
validity indices have been mostly studied in a two or three-dimensionnal datasets, we
have chosen to evaluate them in a real-world applications, document and word clus-
tering. Secondly, we propose a new context-aware method that aims at enhancing the
validity indices usage as stopping criteria in agglomerative algorithms. Experimental
results show that the method is a step-forward in using, with more reliability, validity
indices as stopping criteria.
16.1
Introduction
Due to the exponentially growing volume of textual data, clustering methods are
gaining increasing attention in text applications, where they can play an essential
role in offering more intelligence and eciency to operations. By textual data,
one can refer to characters, n-grams, words, chunks, sentences, documents, etc.
In this chapter, we focus on document and word clustering. Both tasks can be
extremely useful in a wide range of applications. On the one hand, document
clustering plays a key role especially in Information Retrieval (IR), by improving
systems' precision and recall [1], by enabling a search without typing through
the scatter/gather method [15], and by enabling an easier information access by
groups [37], or by exploratory browsing [18]. On the other hand, word clustering
seeks applications like knowledge acquisition from text [6], query expansion in
IR [26], and word sense disambiguation [34].
A well-known and inherent issue in cluster analysis is to require a minimal
input parameters [13]. Yet, most clustering methods still require the predefini-
tion of a number of parameters usually unknown by the user, such as the desired
number of clusters. In practice, this is an ill-posed problem since the final par-
titions will depend on subjectively chosen parameters that do not necessarily fit
the dataset. This can lead to discover spurious patterns not really existing, or
to fail to discover the true patterns.
 
Search WWH ::




Custom Search