Information Technology Reference
In-Depth Information
R (
R (
D a )
·
D b )
s ab =
| R (
|·|R (
D a )
D b )
|
R ( D i )isthe i -th row of the matrix
M (
where
D
), namely, the normalized
signature vector of the document D i .
Complexity: O ( n 2 m 2 ).
3.6 Clustering
The final stage is to use Quasi-Clique Merge algorithm ( QCM [14]) to cluster
all documents. We suppose h is the level number of the hierarchical system.
Then, by the estimation in [14], the number of iterations is bounded by
O ( hn 2 log ( n )). Note that, for an input set of n documents, the number of
hierarchical levels is log( n ) in average. Thus, the complexity of QCM is
O ([ nlog ( n )] 2 ).
3.7 Total Complexity
By summing up all steps, the total complexity is
O ( t + m 2 φ 2 + nm 2 + n 2 m 2 +[ nlog ( n )] 2 ),
where
is the number of the distinct alphabets appearing in the key words,
t is the average length of the documents, φ is the average appearance of a
keyword in a document, m is the total number of keywords, n is the total
number of documents.
Since we compare lots of documents, so φ (the average appearance of
a keyword in a document) is much smaller than n (the total number of
documents), and t is usually less than n 2 m 2 . Thus, the complexity is further
simplified as O ( n 2 m 2 +[ nlog ( n )] 2 ).
|
Σ
|
4 Experimental Results
In order to evaluate the effectiveness of our algorithm, we will compare the
results of our method with the usual keyword frequency method. We calculate
the similarity between every pair of documents the following two different
ways:
KF method: only use keyword frequency information.
KFP method: use keyword frequency and structure information, which is
based on the weighted directed multigraph model described in this paper.
In the following analysis we will show that the KFP method is superior to
the KF method.
 
Search WWH ::




Custom Search