Text Clustering with Mixture of von Mises-Fisher Distributions - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

•

Simulated. We use simulated data to verify that the discrepancy be-

tween computed values of the parameters and their true values is small.

Our simulated data serves the principal purpose of validating the “cor-

rectness” of our implementations. We used a slight modification of the

algorithm given by (53) to generate a set of data points following a

given vMF distribution. We describe herein two synthetic datasets. The

first dataset small-mix is 2-dimensional and is used to illustrate soft-

clustering. The second dataset big-mix is a high-dimensional dataset

that could serve as a model for real world text datasets. Let the triple

( n, d, k ) denote the number of sample points, the dimensionality of a

sample point, and the number of clusters respectively.

1. small-mix: This data has ( n, d, k )=(50 , 2 , 2). The mean direc-

tion of each component is a random unit vector. Each component

has κ =4.

2. big-mix: data has ( n, d, k ) = (5000 , 1000 , 4). The mean direction

of each component is a random unit vector, and the κ values of the

components are 650 . 98, 266 . 83, 267 . 83, and 612 . 88. The mixing

weights for each component are 0 . 251, 0 . 238, 0 . 252, and 0 . 259.

•

Classic3. This is a well known collection of documents. It is an easy

dataset to cluster since it contains documents from three well-separated

sources. Moreover, the intrinsic clusters are largely balanced.

1. Classic3 is a corpus containining 3893 documents, among which

1400 Cranfield documents are from aeronautical system papers,

1033 Medline documents are from medical journals, and 1460 Cisi

documents are from information retrieval papers. The particular

vector space model used had a total of 4666 features (words). Thus

each document, after normalization, is represented as a unit vector

in a 4666-dimensional space.

2. Classic300 is a subset of the Classic3 collection and has 300 doc-

uments. From each category of Classic3, we picked 100 documents

at random to form this particular dataset. The dimensionality of

the data was 5471. 2

3. Classic400 is a subset of Classic3 that has 400 documents. This

dataset has 100 randomly chosen documents from the Medline

and Cisi categories and 200 randomly chosen documents from the

Cranfield category. This dataset is specifically designed to create

unbalanced clusters in an otherwise easily separable and balanced

dataset. The dimensionality of the data was 6205.

2 Note that the dimensionality in Classic300 is larger than that of Classic3. Although the

same options were used in the MC toolkit for word pruning, due to very different word

distributions, fewer words got pruned for Classic300 in the 'too common' or 'too rare'

categories.

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home