Database Reference
In-Depth Information
MI values on small−news20−sim3
MI values on news20−sim3
0.4
0.5
fskmeans
spkmeans
hard−movMF
soft−movMF
fskmeans
spkmeans
hard−movMF
soft−movMF
0.35
0.45
0.3
0.4
0.25
0.35
0.2
0.3
0.15
0.25
0.1
0.2
0.05
0.15
0
0.1
2
3
4
5
6
7
8
9
10
11
2
3
4
5
6
7
8
9
10
11
Number of clusters, k
Number of clusters, k
(a) MI values for Same-100.
(b) MI values for Same-1000.
FIGURE 6.7 : Comparison of the algorithms for more subsets of 20 News-
group data.
6.7.7 Slashdot Datasets
The Slashdot dataset was created to test the performance of the moVMF
model on a typical web application. To gain a better understanding of the
relative performance of the model compared to other state-of-the-art models
for text clustering and topic modeling, moVMF was compared with latent
Dirichlet allocation (LDA) (12) and the exponential family approximation
of the Dirichlet compounded multinomial (EDCM) model (23). Table 6.8
shows the comparative performance in terms of cluster quality measured by
normalized mutual information (NMI), and in terms of running time. Overall,
moVMF gives significantly better clustering results, while the running time is
an order of magnitude less compared to the other algorithms. Similar results
on other benchmark datasets have been reported by (4).
TABLE 6.8: Performance comparison of algorithms
averaged over 5 runs.
NMI
Run Time (sec)
Dataset
moVMF
EDCM
LDA
vMF
EDCM
LDA
slash-7
0.39
0.22
0.31
15
40
47
slash-6
0.65
0.36
0.46
6
26
36
Table 6.9 shows the qualitative performance of moVMF model on the Slash-
7 dataset in terms of the top keywords associated with five of the clusters.
The “topics” associated with each cluster is of comparable quality to that
 
Search WWH ::




Custom Search