Text Clustering with Mixture of von Mises-Fisher Distributions - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

Algorithm 7 spkmeans

Require: Set X of data points on S

d− 1

Ensure: A disjoint k -partitioning {X h }

h =1 of X

Initialize μ h ,h =1 ,

···

,k

repeat

{

The E (Expectation) step of EM

}

Set

X h ←∅

,h =1 ,

···

,k

for i =1to n do

X h ←X h ∪{

x i μ h

x i }

where h =argmax

h

{

The M (Maximization) step of EM

}

for h =1to k do

μ h ←

x ∈X h x

until convergence .

4. moVMF based clustering using soft assignments— soft-moVMF .

It has already been established that kmeans using Euclidean distance per-

forms much worse than spkmeans for text data (49), so we do not consider

it here. Generative model based algorithms that use mixtures of Bernoulli

or multinomial distributions, which have been shown to perform well for text

datasets, have also not been included in the experiments. This exclusion is

done as a recent empirical study over 15 text datasets showed that simple ver-

sionsofvMFmixturemodels(with κ constant for all clusters) outperform the

multinomial model except for only one dataset (Classic3), and the Bernoulli

model was inferior for all datasets (56). Further, for certain datasets, we

compare clustering performance with latent Dirichlet allocation (LDA) (12)

and exponential family approximation of Dirichlet compounded multinomial

(EDCM) models (23).

6.7.1 Datasets

The datasets that we used for empirical validation and comparison of our

algorithms were carefully selected to represent some typical clustering prob-

lems. We also created various subsets of some of the datasets for gaining

greater insight into the nature of clusters discovered or to model some partic-

ular clustering scenario (e.g., balanced clusters, skewed clusters, overlapping

clusters, etc.). We drew our data from five sources: Simulated, Classic3, Ya-

hoo News, 20 Newsgroups, and Slashdot. For all the text document datasets,

the toolkit MC (17) was used for creating a high-dimensional vector space

model that each of the four algorithms utilized. Matlab code was used to

render the input as a vector space for the simulated datasets.

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home