Exploring Validity Indices for Clustering Textual Data - Mining Complex Data

Information Technology Reference

In-Depth Information

16

Exploring Validity Indices

for Clustering Textual Data

AhmadElSayed,HakimHacid,andDjamelZighed

University of Lyon

ERIC Laboratory- 5, avenue Pierre Mendes-France

69676 Bron cedex - France

{ asayed, hhacid, dzighed } @eric.univ-lyon2.fr

Abstract. The goal of any clustering algorithm producing flat partitions of data, is

to find both the optimal clustering solution and the optimal number of clusters. One

natural way to reach this goal without the need for parameters, is to involve a validity

index in a clustering process, which can lead to an objective selection of the optimal

number of clusters. In this chapter, we provide two main contributions. Firstly, since

validity indices have been mostly studied in a two or three-dimensionnal datasets, we

have chosen to evaluate them in a real-world applications, document and word clus-

tering. Secondly, we propose a new context-aware method that aims at enhancing the

validity indices usage as stopping criteria in agglomerative algorithms. Experimental

results show that the method is a step-forward in using, with more reliability, validity

indices as stopping criteria.

16.1

Introduction

Due to the exponentially growing volume of textual data, clustering methods are

gaining increasing attention in text applications, where they can play an essential

role in offering more intelligence and eciency to operations. By textual data,

one can refer to characters, n-grams, words, chunks, sentences, documents, etc.

In this chapter, we focus on document and word clustering. Both tasks can be

extremely useful in a wide range of applications. On the one hand, document

clustering plays a key role especially in Information Retrieval (IR), by improving

systems' precision and recall [1], by enabling a search without typing through

the scatter/gather method [15], and by enabling an easier information access by

groups [37], or by exploratory browsing [18]. On the other hand, word clustering

seeks applications like knowledge acquisition from text [6], query expansion in

IR [26], and word sense disambiguation [34].

A well-known and inherent issue in cluster analysis is to require a minimal

input parameters [13]. Yet, most clustering methods still require the predefini-

tion of a number of parameters usually unknown by the user, such as the desired

number of clusters. In practice, this is an ill-posed problem since the final par-

titions will depend on subjectively chosen parameters that do not necessarily fit

the dataset. This can lead to discover spurious patterns not really existing, or

to fail to discover the true patterns.

Search WWH ::

Custom Search

Home