Database Reference
In-Depth Information
Simulated. We use simulated data to verify that the discrepancy be-
tween computed values of the parameters and their true values is small.
Our simulated data serves the principal purpose of validating the “cor-
rectness” of our implementations. We used a slight modification of the
algorithm given by (53) to generate a set of data points following a
given vMF distribution. We describe herein two synthetic datasets. The
first dataset small-mix is 2-dimensional and is used to illustrate soft-
clustering. The second dataset big-mix is a high-dimensional dataset
that could serve as a model for real world text datasets. Let the triple
( n, d, k ) denote the number of sample points, the dimensionality of a
sample point, and the number of clusters respectively.
1. small-mix: This data has ( n, d, k )=(50 , 2 , 2). The mean direc-
tion of each component is a random unit vector. Each component
has κ =4.
2. big-mix: data has ( n, d, k ) = (5000 , 1000 , 4). The mean direction
of each component is a random unit vector, and the κ values of the
components are 650 . 98, 266 . 83, 267 . 83, and 612 . 88. The mixing
weights for each component are 0 . 251, 0 . 238, 0 . 252, and 0 . 259.
Classic3. This is a well known collection of documents. It is an easy
dataset to cluster since it contains documents from three well-separated
sources. Moreover, the intrinsic clusters are largely balanced.
1. Classic3 is a corpus containining 3893 documents, among which
1400 Cranfield documents are from aeronautical system papers,
1033 Medline documents are from medical journals, and 1460 Cisi
documents are from information retrieval papers. The particular
vector space model used had a total of 4666 features (words). Thus
each document, after normalization, is represented as a unit vector
in a 4666-dimensional space.
2. Classic300 is a subset of the Classic3 collection and has 300 doc-
uments. From each category of Classic3, we picked 100 documents
at random to form this particular dataset. The dimensionality of
the data was 5471. 2
3. Classic400 is a subset of Classic3 that has 400 documents. This
dataset has 100 randomly chosen documents from the Medline
and Cisi categories and 200 randomly chosen documents from the
Cranfield category. This dataset is specifically designed to create
unbalanced clusters in an otherwise easily separable and balanced
dataset. The dimensionality of the data was 6205.
2 Note that the dimensionality in Classic300 is larger than that of Classic3. Although the
same options were used in the MC toolkit for word pruning, due to very different word
distributions, fewer words got pruned for Classic300 in the 'too common' or 'too rare'
categories.
Search WWH ::




Custom Search