Text Clustering with Mixture of von Mises-Fisher Distributions - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

6.7.2 Methodology

Performance of the algorithms on all the datasets has been analyzed using

mutual information (MI) between the cluster and class labels. MI quantifies

the amount of statistical similarity between the cluster and class labels (16).

If X is a random variable for the cluster assignments and Y is a random

variable for the pre-existing labels on the same data, then their MI is given

by I ( X ; Y )= E [ln p ( X,Y )

p ( X ) p ( Y ) ] where the expectation is computed over the joint

distribution of ( X, Y ) estimated from a particular clustering of the dataset

under consideration. To facilitate computing MI, for soft-moVMF we “harden”

the clustering produced by labeling a point with the cluster label for which it

has the highest value of posterior probability (ties broken arbitrarily). Note

that variants of MI have been used to evaluate clustering algorithms by several

researchers. The authors of (32) used a related concept called variation of

information to compare clusterings. An MDL-based formulation that uses

the MI between cluster assignments and class labels was proposed by (22).

All results reported herein have been averaged over 10 runs. All algorithms

were started with the same random initialization to ensure fairness of compar-

ison. Each run was started with a different random initialization. However,

no algorithm was restarted within a given run and all of them were allowed to

run to completion. Since the standard deviations of MI were reasonably small

for all algorithms, to reduce clutter, we have chosen to omit a display of error

bars in our plots. Also, for practical reasons, the estimate of κ was upper

bounded by a large number (10 4 , in this case) in order to prevent numeric

overflows. For example, during the iterations, if a cluster has only one point,

theestimateof κ will be infinity (a divide by zero error). Upper bounding

theestimateof κ is similar in flavor to ensuring the non-singularity of the

estimated covariance of a multivariate Gaussian in a mixture of Gaussians.

6.7.3 Simulated Datasets

First, to build some intuition and confidence in the working of our vMF

based algorithms we exhibit relevant details of soft-moVMF 's behavior on the

small-mix dataset shown in Figure 6.4(a) .

The clustering produced by our soft cluster assignment algorithm is shown

in Figure 6.4(b). The four points (taken clockwise) marked with solid circles

have cluster labels (0 . 15 , 0 . 85), (0 . 77 , 0 . 23), ( . 82 ,. 18), and ( . 11 ,. 89), where a

cluster label ( p, 1

p ) for a point means that the point has probability p of

belonging to Cluster 1 and probability 1

−

p of belonging to Cluster 2. All

other points are categorized to belong to a single cluster by ignoring small

(less than 0 . 10) probability values.

The confusion matrix, obtained by “hardening” the clustering produced

−

by soft-moVMF for the small-mix dataset, is 26 1

023

.As sevident rom

this confusion matrix, the clustering performed by soft-moVMF is excellent,

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home