Text Clustering with Mixture of von Mises-Fisher Distributions - Text Mining: Classification, Clustering, and Applications - page 145

Database Reference

In-Depth Information

TABLE 6.3:

Performance of soft-moVMF on big-mix dataset.

max |κ−κ|

|

avg |κ−κ|

|

max |α−α|

|

avg |α−α|

|

min μ T

avg μ T

μ

μ

κ

|

κ

|

α

|

α

|

0.994

0.998

0.006

0.004

0.002

0.001

6.7.4 Classic3 Family of Datasets

Table 6.4 shows typical confusion matrices obtained for the full Classic3

dataset. We observe that the performance of all the algorithms is quite sim-

ilar and there is no added advantage yielded by using the general moVMF

model as compared to the other algorithms. This observation can be ex-

plained by noting that the clusters of Classic3 are well separated and have

a sucient number of documents. For this clustering hard-moVMF yielded

κ values of (732 . 13 , 809 . 53 , 1000 . 04), while soft-moVMF reported κ values of

(731 . 55 , 808 . 21 , 1002 . 95).

TABLE 6.4: Comparative confusion matrices for 3 clusters of Classic3

(rows represent clusters).

fskmeans

spkmeans

hard-moVMF

soft-moVMF

med

cisi

cran

med

cisi

cran

med

cisi

cran

med

cisi

cran

1019

0

0

1019

0

0

1018

0

0

1019

0

1

1

6

1386

1

6

1386

2

6

1387

1

4

1384

13

1454

12

13

1454

12

13

1454

11

13

1456

13

Table 6.5 shows the confusion matrices obtained for the Classic300 dataset.

Even though Classic300 is well separated, the small number of documents per

cluster makes the problem somewhat dicult for fskmeans and spkmeans ,

while hard-moVMF has a much better performance due to its model flexibility.

The soft-moVMF algorithm performs appreciably better than the other three

algorithms.

It seems that the low number of documents does not pose a problem for

soft-moVMF and it ends up getting an almost perfect clustering for this

dataset. Thus in this case, despite the low number of points per clus-

ter, the superior modeling power of our moVMF based algorithms prevents

them from getting trapped in inferior local-minima as compared to the other

algorithms—resulting in a better clustering.

The confusion matrices obtained for the Classic400 dataset are displayed in

Table 6.6 . The behavior of the algorithms for this dataset is quite interesting.

As before, due to the small number of documents per cluster, fskmeans and

spkmeans give a rather mixed confusion matrix. The hard-moVMF algorithm

gets a significant part of the bigger cluster correctly and achieves some amount

of separation between the two smaller clusters. The soft-moVMF algorithm

exhibits a somewhat intriguing behavior. It splits the bigger cluster into two,

Next Page

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home