Information Technology Reference
In-Depth Information
Fig. 7. Execution time of different algorithms on randomly generated networks
in comparison with other algorithms is tremendous and HSNPI outperforms GA when
the cost of generated multicast trees is considered.
7 Document Clustering
Fast and high quality document clustering has become an increasingly important
technique for enhancing search engine results, web crawling, unsupervised document
organization, and information retrieval or filtering. Clustering involves dividing a set
of documents into a specified number of groups. The documents within each group
should exhibit a large degree of similarity and the similarity among different clusters
should be minimized. Some of the more familiar clustering methods are: partitioning
algorithms based on dividing entire data into dissimilar groups, hierarchical methods,
density and grid based clustering, and some graph based methods [34, 35].
In most document clustering algorithms, documents are represented using a vector-
space model. In this model, each document d is considered to be a vector
{
G
}
in term-space (set of document 'words') where d is the weight of
dimension i in vector space and t is the number of term dimensions. The most
widely used weighting approach for term weights is the combination of Term Fre-
quency and Inverse Document Frequency (TF-IDF) [36-38].
The similarity between two documents must be measured in some way if a cluster-
ing algorithm is to be used. The vector space model gives us a good opportunity for
defining different metrics for similarity between two documents. The most common
similarity metrics are Minkowski distances [39] and the cosine measure [36, 38, 40].
d
=
d
,
d
,
,
d
1
2
t
Search WWH ::




Custom Search