Information Technology Reference
In-Depth Information
the documents and the edge set E =
{
e ( i,j )
}
is the Euclidean distance between
the document features. The ”cut” is defined as the sum of edges between two
clusters, defined as
cut ( A,B )=
iA,jB
e ( i,j )
(8)
The criterion of the Average Association and Normalized Cut are formulated as
AA = cut ( A,A )
|
+ cut ( B,B )
|
(9)
A
|
B
|
NC = cut ( A,B )
cut ( A,V ) + cut ( A,B )
(10)
cut ( B,V )
In the formal work of Average Association, the author has proved that AA is
equivalent to the LSA [1] with the K-means clustering method [10], in respect
of their criterion function. Solving the ”cut” problem is an NP-hard compu-
tational problem, while Shi and Malik propose an approximation method, the
eigenvector-based criterion, which minimizes the normalized cut eciently.
We also implement NMF [4] and NNSC [5] for comparison. The setup of these
two methods is same as that of the NMSC. There are 50 testing rounds in total.
Hence, the final accuracy is the average accuracies of all the rounds.
Fig. 3. Comparison of performance. The accuracy of each cluster method results in
different cluster number k.
We can see from the Figure 3 that comparing with the other methods in-
cluding NNSC, the NMSC performs best in the task of document clustering.
Actually, the NNSC is just one layer of NMSC, whose bases are assumed to be
same weighted for data representation, while NMSC can more precisely discrim-
inate the importance of each base through discovering the hierarchical structure.
Search WWH ::




Custom Search