Information Technology Reference
In-Depth Information
3.2 Document Features
The famous weighted term-frequency (tf-idf) [8] vector is used to represent the
document. We collected the vocabulary set W =
after remov-
ing stop-words. The tf-idf vector of d i document is defined as X i =
[ x 1 i ,x 2 i ,
{
w 1 ,w 2 ,
···
,w m }
,x mi ] T ,
···
log ( n
x ji = t ji ·
idf j )
(6)
in which t ji , idf j , n denote the term frequency of the word w j ,thenumberofthe
whole documents containing w j , and the number of the documents. Particularly,
the words in title should generally be more important than in text. According to
statistics, a word in the title has 5 times the significance of the same one in the
context [9]. So during processing word frequency, all the words existing both in
title and text should be taken into specific consideration. The definition of x ji is
formulated as x ji =6 t ji ·
log ( n
idf j ) . Moreover, each vector X i is also normalized
into unit length. Thus, the n
×
m matrix X denotes the data matrix with the
non-negative elements.
3.3 Evaluation
The test document data is randomly selected from the data matrix X, mixed
with the documents from several clusters. For each round of test, the document
feature vectors from selected k clusters are processed by the cluster methods.
We evaluate the result with the labeled cluster from the ground truth provided
by the original Reuters data. The accuracy of the clustering result is defined as
the proportion of the documents which are partitioned into the same cluster in
the ground truth:
i =1 δ ( c i ,l i )
n
n
accuracy =
(7)
where c i is the index of the cluster that document d i belongs to, and l i is the
index of the labeled cluster in the ground truth. We apply the max matching
strategy or the Kuhn-Munkres algorithm as the assignment approach to find
matched cluster between the result and ground truth.
3.4 Implementation and Comparisons
The proposed NMSC is much suitable to implement for the task of document
clustering, due to the non-negative property of the document feature. In this
test, we set up the NMSC with 3 layers and the method automatically removed
the cluster containing less than 5 documents.
To evaluate the proposed NMSC, we also tested the same data with the other
two famous graph-based cluster methods, Average Association (AA in short) [10]
and Normalized Cut (NC in short) [11]. The graph G = G ( V,E ) is the input
of these graph-based methods, in which the vertex set V =
{
d j }
is the set of
Search WWH ::




Custom Search