Information Technology Reference
In-Depth Information
document clustering. Particle swarm optimization (PSO) [44] is another computational
intelligence method that has been applied to image clustering and other low dimen-
sional datasets in [39, 45, 46] and to document clustering in [42]. HS is employed for
document clustering in [47, 48].
To compare the quality and the speed of different clustering algorithms, some
known data sets are available and have been used. In all of datasets, before applying
clustering algorithm, the very common words (stop words) are stripped out com-
pletely and different forms of a word are reduced to one canonical form by using Por-
ter's algorithm and then converted to the vector space model.
To demonstrate the document clustering accuracy in comparison to the best con-
temporary methods, five data sets are selected from different known sources. Data
sets DS1 and DS2 are from TREC-5, TREC-6, and TREC-7 [49]; the data set DS3
was derived from the San Jose Mercury newspaper articles that are distributed as part
of the TREC collection (TIPSTER); the data set DS4 is selected from the DMOZ col-
lection; and the DS5 dataset is a collection of 10,000 messages, collected from 10 dif-
ferent Usenet newsgroups (1,000 messages from each). After preprocessing, there are
a total of 9249 documents in this data set.
Figure 8 compares five different algorithms on the selected datasets. These algo-
rithms includes HS clustering [47], K- means (best known partitioning algorithm), ge-
netic K -means (GA) [50], particle swarm optimization based clustering (PSO) [42]
and a Mises-Fisher generative model based algorithm (GM) [51, 52]. Figure 8 shows
the results of applying these algorithms on five datasets considering the normalized
ADDC of algorithm. From the results, it is easy to know that the HS method outper-
forms GA, K -means, and PSO in all datasets, while the GM algorithm generates
higher quality clusters than the HS based algorithm for the dataset DS2.
K-means
Harmony GA
PSO GM
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
DS1
DS2
DS3
DS4
DS5
Datas e t
Fig. 8. Quality of clustering generated by various algorithms
Search WWH ::




Custom Search