Database Reference
In-Depth Information
TABLE 6.3:
Performance of soft-moVMF on big-mix dataset.
max |κ−κ|
|
avg |κ−κ|
|
max |α−α|
|
avg |α−α|
|
min μ T
avg μ T
μ
μ
κ
|
κ
|
α
|
α
|
0.994
0.998
0.006
0.004
0.002
0.001
6.7.4 Classic3 Family of Datasets
Table 6.4 shows typical confusion matrices obtained for the full Classic3
dataset. We observe that the performance of all the algorithms is quite sim-
ilar and there is no added advantage yielded by using the general moVMF
model as compared to the other algorithms. This observation can be ex-
plained by noting that the clusters of Classic3 are well separated and have
a sucient number of documents. For this clustering hard-moVMF yielded
κ values of (732 . 13 , 809 . 53 , 1000 . 04), while soft-moVMF reported κ values of
(731 . 55 , 808 . 21 , 1002 . 95).
TABLE 6.4: Comparative confusion matrices for 3 clusters of Classic3
(rows represent clusters).
fskmeans
spkmeans
hard-moVMF
soft-moVMF
med
cisi
cran
med
cisi
cran
med
cisi
cran
med
cisi
cran
1019
0
0
1019
0
0
1018
0
0
1019
0
1
1
6
1386
1
6
1386
2
6
1387
1
4
1384
13
1454
12
13
1454
12
13
1454
11
13
1456
13
Table 6.5 shows the confusion matrices obtained for the Classic300 dataset.
Even though Classic300 is well separated, the small number of documents per
cluster makes the problem somewhat dicult for fskmeans and spkmeans ,
while hard-moVMF has a much better performance due to its model flexibility.
The soft-moVMF algorithm performs appreciably better than the other three
algorithms.
It seems that the low number of documents does not pose a problem for
soft-moVMF and it ends up getting an almost perfect clustering for this
dataset. Thus in this case, despite the low number of points per clus-
ter, the superior modeling power of our moVMF based algorithms prevents
them from getting trapped in inferior local-minima as compared to the other
algorithms—resulting in a better clustering.
The confusion matrices obtained for the Classic400 dataset are displayed in
Table 6.6 . The behavior of the algorithms for this dataset is quite interesting.
As before, due to the small number of documents per cluster, fskmeans and
spkmeans give a rather mixed confusion matrix. The hard-moVMF algorithm
gets a significant part of the bigger cluster correctly and achieves some amount
of separation between the two smaller clusters. The soft-moVMF algorithm
exhibits a somewhat intriguing behavior. It splits the bigger cluster into two,
 
Search WWH ::




Custom Search