Database Reference
In-Depth Information
We derive two clustering algorithms based on Expectation Maximization
(EM) for estimating the parameters of the mixture model from first princi-
ples. Our algorithms involve estimating a concentration parameter, κ ,for
each component of the mixture model. The ability to adapt κ on a per-
component basis leads to substantial performance improvements over existing
generative approaches to modeling directional data. We show a connection
between the proposed methods and a class of existing algorithms for cluster-
ing high-dimensional directional data. In particular, our generative model has
the same relation to spherical kmeans ( spkmeans ) (20) as a model based on
a mixture of identity covariance Gaussians has to classical kmeans that uses
squared Euclidean distances (9). We also present detailed experimental com-
parisons of the proposed algorithms with spkmeans and one of its variants.
Our formulation uncovers the theoretical justification behind the use of the
cosine similarity measure that has largely been ad hoc , i.e., based on empirical
or intuitive justification, so far.
While this chapter focuses on text analysis, we note that many other im-
portant domains such as bioinformatics and collaborative filtering involve di-
rectional data as well. Thus, the scope and applications of the approaches
taken in this chapter are much broader and not limited to text alone.
The remainder of the chapter is organized as follows. In Section 6.2, we dis-
cuss related work on mixture models, text clustering, and vMF distributions.
WereviewthemultivariatevMFdistributioninSection6.3. InSection6.4
we introduce a generative model using a mixture of vMF distributions. We
then derive the maximum likelihood parameter estimates of this model by
employing an EM framework. Section 6.5 highlights our new method of ap-
proximating κ and also presents a mathematical analysis of hard assignments.
Sections 6.4 and 6.5 form the basis for two clustering algorithms using soft
and hard-assignments, respectively, and these algorithms are described in Sec-
tion 6.6. Detailed experimental results and comparisons with other algorithms
are offered in Section 6.7. A discussion on the behavior of our algorithms and
a connection with simulated annealing follows in Section 6.8, and we conclude
in Section 6.9.
Notation. Bold faced variables, e.g., x represent vectors; the norm
·
denotes the L 2 norm; sets are represented by script-style upper-case letters,
e.g.,
d− 1
X
,
Z
. The set of reals is denoted by
R
, while
S
denotes the ( d
d . Probability density functions are
denoted by lower case letters such as f , p , q , and the probability of a set of
events is denoted by P .
1)-dimensional sphere embedded in
R
Search WWH ::




Custom Search