Information Technology Reference
In-Depth Information
Table 16.1. Summary of datasets used for our study on document clustering
Dataset #ofdocuments#oftopics
DS1
500
48
DS2
1000
65
16.4.3
Benchmark for Word Clustering
The benchmark used for word clustering is constituted of two datasets, whose
general characteristics are summarized in Table 16.2. These are two distinct
datasets (no common words) extracted from the SemCor corpus [32]. SemCor is
a collection of 352 texts where each token is annotated with POS, lemma, and
a sense (synset from WordNet [23]). A polysemous word can be associated to
many senses in the corpus, as long as it can occur in many different contexts.
Words are represented with respect to the Distributional Hypothesis [14]. Thus,
aword w is represented by a vector v in a multidimensional space, where each
dimension is a local context word w expressed by its PMI score (Pointwise
Mutual Information) with w [5]. A 2-word window size is considered for context
words. We turned to language-independent practices representing a context by
a plain word concatenated with its relative positions to the target word (e.g.,
before-tea , after-coffee ). Finally, a semantic similarity is calculated between a
pair of words by means of the cosines coe cient between their feature vectors.
Table 16.2. Summary of datasets used for our study on word clustering
Dataset #ofwords#ofsenses
DS3
500
52
DS4
1000
78
16.4.4
Evaluation Methodology
The advantage of using the outlined corpora (i.e., Reuters, SemCor) is that
objects have been manually pre-classified by experts. Having this at hand, one
can define such artificial structures as the ideal “Gold Standard” structures for
a clustering algorithm. Indeed, we have to point out that these structures reflect
only a certain level of granularity that could be too specific or too generic for
the data. Thus, we cannot claim that the predefined partition for each dataset
is the only correct partition, but it is indeed a correct one that we could reliably
consider as a “Gold Standard”.
Subsequently, an external validity index (e.g., FScore )isusedtoevaluatethe
partition provided by an algorithm against the predefined partition. Moreover,
one effective way for evaluating relative indices is to compare their behaviors
with those of external indices which we suppose bear the optimal behaviors
since they are based on structures set a-priori by experts.
 
Search WWH ::




Custom Search