Information Technology Reference
In-Depth Information
Let R ( D i ) be the signature vector of the i -th document.
Let M =[ R ( D 1 ) ,R ( D 2 ) , ..., R ( D n− 1 ) ,R ( D n )] T ,then M is an n
( m +
m 2 )matrix( n is the total number of the documents, m is the cardinality of
the keywords set
×
K
. Each row of the matrix represents a document.
2.3.2 Normalization of the Matrix
We normalize the matrix M with respect to the columns for the purpose
of the compatibility in every dimension. We denote the normalized matrix
as
M =[ R ( D 1 ) ,
R ( D 2 ) , ...,
R ( D n− 1 ) ,
R ( D n )] T . And the details of the
normalization is presented in next section.
2.3.3 Similarity
The similarity S ab between any two documents D a , D b is determined by the
cosine similarity as follows
S ab = | R ( D a )
· R ( D b )
|
| R ( D a )
|·|R ( D b )
|
R ( D a ) ,
R ( D b ) are the normalized signature vectors of the documents
where
D a , D b .
2.4 Details of the Step 3
A variety of different clustering algorithms have been developed and imple-
mented in popular statistical software packages. A general review of cluster
analysis can be found in many references, for instance, [4, 3, 11], etc. None of
these algorithms can, in general, rigorously guarantee to produce a globally
optimal clustering for non-trivial objective functions [23].
After calculating the pairwise similarities of all documents, we then clas-
sify these documents into different groups by applying the Quasi-Clique
Merge(QCM) method to cluster the documents. It is observed that one of the
most significant differences between the QCM method and other clustering
algorithms is that the QCM method constructs a much smaller hierarchical
tree. This tree structure leads to better identification of meaningful clusters
since there are fewer subdivisions of the data set due to the impact of irrele-
vant or improperly interpreted information. Additionally, the QCM method
results in multi-membership clustering [14], which preserves some amount of
the ambiguity inherent in the data set rather than errantly suppressing it as
many other clustering algorithms do.
 
Search WWH ::




Custom Search