Text Clustering with Mixture of von Mises-Fisher Distributions - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

We derive two clustering algorithms based on Expectation Maximization

(EM) for estimating the parameters of the mixture model from first princi-

ples. Our algorithms involve estimating a concentration parameter, κ ,for

each component of the mixture model. The ability to adapt κ on a per-

component basis leads to substantial performance improvements over existing

generative approaches to modeling directional data. We show a connection

between the proposed methods and a class of existing algorithms for cluster-

ing high-dimensional directional data. In particular, our generative model has

the same relation to spherical kmeans ( spkmeans ) (20) as a model based on

a mixture of identity covariance Gaussians has to classical kmeans that uses

squared Euclidean distances (9). We also present detailed experimental com-

parisons of the proposed algorithms with spkmeans and one of its variants.

Our formulation uncovers the theoretical justification behind the use of the

cosine similarity measure that has largely been ad hoc , i.e., based on empirical

or intuitive justification, so far.

While this chapter focuses on text analysis, we note that many other im-

portant domains such as bioinformatics and collaborative filtering involve di-

rectional data as well. Thus, the scope and applications of the approaches

taken in this chapter are much broader and not limited to text alone.

The remainder of the chapter is organized as follows. In Section 6.2, we dis-

cuss related work on mixture models, text clustering, and vMF distributions.

WereviewthemultivariatevMFdistributioninSection6.3. InSection6.4

we introduce a generative model using a mixture of vMF distributions. We

then derive the maximum likelihood parameter estimates of this model by

employing an EM framework. Section 6.5 highlights our new method of ap-

proximating κ and also presents a mathematical analysis of hard assignments.

Sections 6.4 and 6.5 form the basis for two clustering algorithms using soft

and hard-assignments, respectively, and these algorithms are described in Sec-

tion 6.6. Detailed experimental results and comparisons with other algorithms

are offered in Section 6.7. A discussion on the behavior of our algorithms and

a connection with simulated annealing follows in Section 6.8, and we conclude

in Section 6.9.

Notation. Bold faced variables, e.g., x ,μ represent vectors; the norm

·

denotes the L 2 norm; sets are represented by script-style upper-case letters,

e.g.,

d− 1

X

,

Z

. The set of reals is denoted by

R

, while

S

denotes the ( d

−

d . Probability density functions are

denoted by lower case letters such as f , p , q , and the probability of a set of

events is denoted by P .

1)-dimensional sphere embedded in

R

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home