Database Reference
In-Depth Information
similarly at the true number of clusters. However, as the number of clusters
increases, soft-moVMF seems to outperform the others by a significant margin.
For Classic300 ( Figure 6.5(b) ) and Classic400 (Figure 6.5(c)), soft-moVMF
seems to significantly outperform the other algorithms. In fact, for these two
datasets, soft-moVMF performs substantially better than the other three, even
at the correct number of clusters. Among the other three, hard-moVMF seems
to perform better than spkmeans and fskmeans across the range of clusters.
6.7.5 Yahoo News Dataset
The Yahoo News dataset is a relatively dicult dataset for clustering since
it has a fair amount of overlap among its clusters and the number of points
per cluster is low. In addition, the clusters are highly skewed in terms of their
comparative sizes.
Results for the different algorithms can be seen in Figure 6.5(d). Over the
entire range, soft-moVMF consistently performs better than the other algo-
rithms. Even at the correct number of clusters k = 20, it performs signifi-
cantly better than the other algorithms.
6.7.6 20 Newsgroup Family of Datasets
Now we discuss clustering performance of the four algorithms on the 20
Newsgroup datasets. Figure 6.6(a) shows the MI plots for the full News20
dataset. All the algorithms perform similarly until the true number of clusters
after which soft-moVMF and spkmeans perform better than the others. We do
not notice any interesting differences between the four algorithms from this
Figure.
Figure 6.6(b) shows MI plots for the Small-News20 dataset and the results
are of course different. Since the number of documents per cluster is small
(100), as before spkmeans and fskmeans do not perform that well, even at
the true number of clusters, whereas soft-moVMF performs considerably bet-
ter than the others over the entire range. Again, hard-moVMF exhibits good
MI values until the true number of clusters, after which it falls sharply. On
the other hand, for the datasets that have a reasonably large number of doc-
uments per cluster, another kind of behavior is usually observed. All the
algorithms perform quite similarly until the true number of clusters, after
which soft-moVMF performs significantly better than the other three. This
behavior can be observed in Figures 6.6(d), 6.6(f), and 6.7(b). We note that
the other three algorithms perform quite similarly over the entire range of
clusters. We also observe that for an easy dataset like Different-1000, the MI
values peak at the true number of clusters, whereas for a more dicult dataset
such as Similar-1000 the MI values increase as the clusters get further refined.
This behavior is expected since the clusters in Similar-1000 have much greater
overlap than those in Different-1000.
Search WWH ::




Custom Search