Database Reference
In-Depth Information
Entropy over Iterations for soft−movMF
1.8
news20−same3
small−news20−same3
news20−diff3
small−news20−diff3
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
5
10
15
20
25
Number of iterations
FIGURE 6.8 (SEE COLOR INSERT FOLLOWING PAGE 130.) :
Variation
of
entropy
of
hidden
variables
with
number
of
iterations
( soft-movMF ).
tive criterion for evaluation and model-selection for clustering algorithms was
proposed in (8): how well does the clustering algorithm perform as a pre-
diction algorithm. The prediction accuracy of the clustering is measured by
the PAC-MDL bound (13; 8) that upper-bounds the error-rate of predictions
on the test-set. The way to use it for model-selection is quite straightfor-
ward: among a range of number of clusters, choose the one that achieves
the minimum bound on the test-set error-rate. Experiments on model selec-
tion applied to several clustering algorithms were reported by (8). Interest-
ingly, the movMF-based algorithms almost always obtained the 'right number
of clusters'—in this case, the underlying labels in the dataset were actually
known and the number of labels were considered to be the right number of
clusters. It is important to note that this form of model-selection only works
in a semi-supervised setting where a little amount of labeled data is available
for model selection.
6.9 Conclusions and Future Work
From the experimental results, it seems that high-dimensional text data
have properties that match well with the modeling assumptions of the vMF
mixture model. This motivates further study of such models. For example,
 
Search WWH ::




Custom Search