Non-negative Mutative-Sparseness Coding towards Hierarchical Representation - Intelligent Computing for Sustainable Energy and Environment

Information Technology Reference

In-Depth Information

3.2 Document Features

The famous weighted term-frequency (tf-idf) [8] vector is used to represent the

document. We collected the vocabulary set W =

after remov-

ing stop-words. The tf-idf vector of d i document is defined as X i =

[ x 1 i ,x 2 i ,

{

w 1 ,w 2 ,

···

,w m }

,x mi ] T ,

···

log ( n

x ji = t ji ·

idf j )

(6)

in which t ji , idf j , n denote the term frequency of the word w j ,thenumberofthe

whole documents containing w j , and the number of the documents. Particularly,

the words in title should generally be more important than in text. According to

statistics, a word in the title has 5 times the significance of the same one in the

context [9]. So during processing word frequency, all the words existing both in

title and text should be taken into specific consideration. The definition of x ji is

formulated as x ji =6 t ji ·

log ( n

idf j ) . Moreover, each vector X i is also normalized

into unit length. Thus, the n

×

m matrix X denotes the data matrix with the

non-negative elements.

3.3 Evaluation

The test document data is randomly selected from the data matrix X, mixed

with the documents from several clusters. For each round of test, the document

feature vectors from selected k clusters are processed by the cluster methods.

We evaluate the result with the labeled cluster from the ground truth provided

by the original Reuters data. The accuracy of the clustering result is defined as

the proportion of the documents which are partitioned into the same cluster in

the ground truth:

i =1 δ ( c i ,l i )

n

accuracy =

(7)

where c i is the index of the cluster that document d i belongs to, and l i is the

index of the labeled cluster in the ground truth. We apply the max matching

strategy or the Kuhn-Munkres algorithm as the assignment approach to find

matched cluster between the result and ground truth.

3.4 Implementation and Comparisons

The proposed NMSC is much suitable to implement for the task of document

clustering, due to the non-negative property of the document feature. In this

test, we set up the NMSC with 3 layers and the method automatically removed

the cluster containing less than 5 documents.

To evaluate the proposed NMSC, we also tested the same data with the other

two famous graph-based cluster methods, Average Association (AA in short) [10]

and Normalized Cut (NC in short) [11]. The graph G = G ( V,E ) is the input

of these graph-based methods, in which the vertex set V =

{

d j }

is the set of

Intelligent Computing for Sustainable Energy and Environment

Search WWH ::

Custom Search

Home