Exploring Validity Indices for Clustering Textual Data - Mining Complex Data - page 274

Information Technology Reference

In-Depth Information

Table 16.1. Summary of datasets used for our study on document clustering

Dataset #ofdocuments#oftopics

DS1

500

48

DS2

1000

65

16.4.3

Benchmark for Word Clustering

The benchmark used for word clustering is constituted of two datasets, whose

general characteristics are summarized in Table 16.2. These are two distinct

datasets (no common words) extracted from the SemCor corpus [32]. SemCor is

a collection of 352 texts where each token is annotated with POS, lemma, and

a sense (synset from WordNet [23]). A polysemous word can be associated to

many senses in the corpus, as long as it can occur in many different contexts.

Words are represented with respect to the Distributional Hypothesis [14]. Thus,

aword w is represented by a vector v in a multidimensional space, where each

dimension is a local context word w expressed by its PMI score (Pointwise

Mutual Information) with w [5]. A 2-word window size is considered for context

words. We turned to language-independent practices representing a context by

a plain word concatenated with its relative positions to the target word (e.g.,

before-tea , after-coffee ). Finally, a semantic similarity is calculated between a

pair of words by means of the cosines coe cient between their feature vectors.

Table 16.2. Summary of datasets used for our study on word clustering

Dataset #ofwords#ofsenses

DS3

500

52

DS4

1000

78

16.4.4

Evaluation Methodology

The advantage of using the outlined corpora (i.e., Reuters, SemCor) is that

objects have been manually pre-classified by experts. Having this at hand, one

can define such artificial structures as the ideal “Gold Standard” structures for

a clustering algorithm. Indeed, we have to point out that these structures reflect

only a certain level of granularity that could be too specific or too generic for

the data. Thus, we cannot claim that the predefined partition for each dataset

is the only correct partition, but it is indeed a correct one that we could reliably

consider as a “Gold Standard”.

Subsequently, an external validity index (e.g., FScore )isusedtoevaluatethe

partition provided by an algorithm against the predefined partition. Moreover,

one effective way for evaluating relative indices is to compare their behaviors

with those of external indices which we suppose bear the optimal behaviors

since they are based on structures set a-priori by experts.

Next Page

Mining Complex Data

Search WWH ::

Custom Search

Home