Text Clustering with Mixture of von Mises-Fisher Distributions - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

one can consider a hybrid algorithm that employs soft-moVMF for the first

few (more important) iterations, and then switches to hard-moVMF for speed,

and measure the speed-quality tradeoff that this hybrid approach provides.

Another possible extension would be to consider an online version of the EM-

based algorithms as discussed in this paper, developed along the lines of (34).

Online algorithms are particularly attractive for dealing with streaming data

when memory is limited, and for modeling mildly non-stationary data sources.

We could also adapt a local search strategy such as the one in (18), for incre-

mental EM to yield better local minima for both hard and soft-assignments.

The vMF distribution that we considered in the proposed techniques is one

of the simplest parametric distributions for directional data. The iso-density

lines of the vMF distribution are circles on the hypersphere, i.e., all points on

the surface of the hypersphere at a constant angle from the mean direction.

In some applications, more general iso-density contours may be desirable.

There are more general models on the unit sphere, such as the Bingham dis-

tribution, the Kent distribution, the Watson distribution (already discussed

in the previous section), the Fisher-Bingham distribution, the Pearson type

VII distributions (42; 30), etc., that can potentially be more applicable in the

general setting. For example, the Fisher-Bingham distributions have added

modeling power since there are O ( d 2 ) parameters for each distribution. How-

ever, the parameter estimation problem, especially in high-dimensions, can

be significantly more dicult for such models, as more parameters need to

be estimated from the data. One definitely needs substantially more data to

get reliable estimates of the parameters. Further, for some cases, e.g., the

Kent distribution, it can be di cult to solve the estimation problem in more

than 3-dimensions (36). Hence these more complex models may not be vi-

able for many high-dimensional problems. Nevertheless, the tradeoff between

model complexity (in terms of the number of parameters and their estimation)

and sample complexity needs to be studied in more detail in the context of

directional data.

Acknowledgments

The authors would like to thank Sugato Basu and Jiye Yu for experiments

with the Slashdot datasets. This research was supported in part by the Digital

Technology Center Data Mining Consortium (DDMC) at the University of

Minnesota, Twin Cities.

Search WWH ::

Custom Search

Home