Database Reference
In-Depth Information
Chapter 6
Text Clustering with Mixture of von
Mises-Fisher Distributions
Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, and Suvrit Sra
6.1
Introduction
.............................................................
121
6.2
Related Work
............................................................
123
6.3
Preliminaries
.............................................................
124
6.4
EM on a Mixture of vMFs (moVMF)
...................................
126
6.5
Handling High-Dimensional Text Datasets
..............................
127
6.6
Algorithms
...............................................................
132
6.7
Experimental Results
....................................................
134
6.8
Discussion
................................................................
146
6.9
Conclusions and Future Work
...........................................
148
6.1 Introduction
There is a long-standing folklore in the information retrieval community
that a vector space representation of text data has directional properties, i.e.,
the direction of the vector is much more important than its magnitude. This
belief has led to practices such as using the cosine between two vectors for
measuring similarity between the corresponding text documents, and to the
scaling of vectors to unit L 2 norm (41; 40; 20).
In this chapter, we describe a probabilistic generative model (44; 25) based
on directional distributions (30) for modeling text data. 1 Specifically, we sug-
gest that a set of text documents that form multiple topics can be well modeled
by a mixture of von Mises-Fisher (vMF) distributions, with each component
corresponding to a topic. Generative models often provide greater insights into
the anatomy of the data as compared to discriminative approaches. Moreover,
domain knowledge can be easily incorporated into generative models; for ex-
ample, in this chapter the directional nature of the data is reflected in our
choice of vMF distributions as the mixture components.
1 This chapter treats L 2 normalized data and directional data as synonymous.
 
 
 
 
 
 
 
 
 
 
Search WWH ::




Custom Search