Database Reference
In-Depth Information
one can consider a hybrid algorithm that employs soft-moVMF for the first
few (more important) iterations, and then switches to hard-moVMF for speed,
and measure the speed-quality tradeoff that this hybrid approach provides.
Another possible extension would be to consider an online version of the EM-
based algorithms as discussed in this paper, developed along the lines of (34).
Online algorithms are particularly attractive for dealing with streaming data
when memory is limited, and for modeling mildly non-stationary data sources.
We could also adapt a local search strategy such as the one in (18), for incre-
mental EM to yield better local minima for both hard and soft-assignments.
The vMF distribution that we considered in the proposed techniques is one
of the simplest parametric distributions for directional data. The iso-density
lines of the vMF distribution are circles on the hypersphere, i.e., all points on
the surface of the hypersphere at a constant angle from the mean direction.
In some applications, more general iso-density contours may be desirable.
There are more general models on the unit sphere, such as the Bingham dis-
tribution, the Kent distribution, the Watson distribution (already discussed
in the previous section), the Fisher-Bingham distribution, the Pearson type
VII distributions (42; 30), etc., that can potentially be more applicable in the
general setting. For example, the Fisher-Bingham distributions have added
modeling power since there are O ( d 2 ) parameters for each distribution. How-
ever, the parameter estimation problem, especially in high-dimensions, can
be significantly more dicult for such models, as more parameters need to
be estimated from the data. One definitely needs substantially more data to
get reliable estimates of the parameters. Further, for some cases, e.g., the
Kent distribution, it can be di cult to solve the estimation problem in more
than 3-dimensions (36). Hence these more complex models may not be vi-
able for many high-dimensional problems. Nevertheless, the tradeoff between
model complexity (in terms of the number of parameters and their estimation)
and sample complexity needs to be studied in more detail in the context of
directional data.
Acknowledgments
The authors would like to thank Sugato Basu and Jiye Yu for experiments
with the Slashdot datasets. This research was supported in part by the Digital
Technology Center Data Mining Consortium (DDMC) at the University of
Minnesota, Twin Cities.
Search WWH ::




Custom Search