Text Clustering with Mixture of von Mises-Fisher Distributions - Text Mining: Classification, Clustering, and Applications - page 153

Database Reference

In-Depth Information

Entropy over Iterations for soft−movMF

1.8

news20−same3

small−news20−same3

news20−diff3

small−news20−diff3

1.6

1.4

1.2

1

0.8

0.6

0.4

0.2

0

0

5

10

15

20

25

Number of iterations

FIGURE 6.8 (SEE COLOR INSERT FOLLOWING PAGE 130.) :

Variation

of

entropy

of

hidden

variables

with

number

of

iterations

( soft-movMF ).

tive criterion for evaluation and model-selection for clustering algorithms was

proposed in (8): how well does the clustering algorithm perform as a pre-

diction algorithm. The prediction accuracy of the clustering is measured by

the PAC-MDL bound (13; 8) that upper-bounds the error-rate of predictions

on the test-set. The way to use it for model-selection is quite straightfor-

ward: among a range of number of clusters, choose the one that achieves

the minimum bound on the test-set error-rate. Experiments on model selec-

tion applied to several clustering algorithms were reported by (8). Interest-

ingly, the movMF-based algorithms almost always obtained the 'right number

of clusters'—in this case, the underlying labels in the dataset were actually

known and the number of labels were considered to be the right number of

clusters. It is important to note that this form of model-selection only works

in a semi-supervised setting where a little amount of labeled data is available

for model selection.

6.9 Conclusions and Future Work

From the experimental results, it seems that high-dimensional text data

have properties that match well with the modeling assumptions of the vMF

mixture model. This motivates further study of such models. For example,

Next Page

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home