Text Clustering with Mixture of von Mises-Fisher Distributions - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

hard-movMF , even though the final probability values obtained by soft-movMF

are actually very close to 0 and 1; and (ii) why is soft-movMF , which needs

to estimate more parameters, doing better even when there are insucient

number of points relative to the dimensionality of the space.

It turns out that both these issues can be understood by taking a closer look

at how soft-moVMF converges. In all our experiments, we initialized κ to 10,

and the initial centroids to small random perturbations of the global centroid.

Hence, for soft-movMF , the initial posterior membership distributions of the

data points are almost uniform and the Shannon entropy of the hidden random

variables is very high. The change of this entropy over iterations for the

News20 subsets is presented in Figure 6.8 . The behavior is similar for all the

other datasets that we studied. Unlike kmeans-based algorithms where most

of the relocation happens in the first two or three iterations with only minor

adjustments later on, in soft-movMF the data points are non-committal in

the first few iterations, and the entropy remains very high (the maximum

possible entropy for 3 clusters can be log 2 3=1 . 585). The cluster patterns

are discovered only after several iterations, and the entropy drops drastically

within a small number of iterations after that. When the algorithm converges,

the entropy is practically zero and all points are effectively hard-assigned

to their respective clusters. Note that this behavior is strikingly similar to

(locally adaptive) annealing approaches where κ canbeconsideredasthe

inverse of the temperature parameter. The drastic drop in entropy after a few

iterations is the typical critical temperature behavior observed in annealing.

As text data has only non-negative features values, all the data points lie in

the first orthant of the d -dimensional hypersphere and, hence, are naturally

very concentrated. Thus, the final κ values on convergence are very high,

reflecting the concentration in the data, and implying a low final tempera-

ture from the annealing perspective. Then, initializing κ to a low value, or

equivalently a high temperature, is a good idea because in that case when

soft-movMF is running, the κ values will keep on increasing over successive

iterations to get to its final large values, giving the effect of a decreasing tem-

perature in the process, without any explicit deterministic annealing strat-

egy. Also different mixture components can take different values of κ ,as

automatically determined by the EM updates, and need not be controlled

by any external heuristic. The cost of the added flexibility in soft-moVMF

over spkmeans is the extra computation in estimating the κ values. Thus the

soft-movMF algorithm provides a trade-off between modeling power and com-

putational demands, but one that, judging from the empirical results, seems

quite worthwhile. The hard-movMF algorithm, instead of using the more gen-

eral vMF model, suffers because of hard-assignments from the very beginning.

The fskmeans and spkmeans do not do well for dicult datasets due to their

hard assignment scheme as well as their significantly less modeling capabili-

ties.

Finally, a word on model selection, since choosing the number of clusters

remains one of the widely debated topics in clustering (31). A new objec-

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home