Graph Model for Pattern Recognition in Text - Mining and Analyzing Social Networks

Information Technology Reference

In-Depth Information

Let R ( D i ) be the signature vector of the i -th document.

Let M =[ R ( D 1 ) ,R ( D 2 ) , ..., R ( D n− 1 ) ,R ( D n )] T ,then M is an n

( m +

m 2 )matrix( n is the total number of the documents, m is the cardinality of

the keywords set

×

K

. Each row of the matrix represents a document.

2.3.2 Normalization of the Matrix

We normalize the matrix M with respect to the columns for the purpose

of the compatibility in every dimension. We denote the normalized matrix

as

M =[ R ( D 1 ) ,

R ( D 2 ) , ...,

R ( D n− 1 ) ,

R ( D n )] T . And the details of the

normalization is presented in next section.

2.3.3 Similarity

The similarity S ab between any two documents D a , D b is determined by the

cosine similarity as follows

S ab = | R ( D a )

· R ( D b )

|

| R ( D a )

|·|R ( D b )

|

R ( D a ) ,

R ( D b ) are the normalized signature vectors of the documents

where

D a , D b .

2.4 Details of the Step 3

A variety of different clustering algorithms have been developed and imple-

mented in popular statistical software packages. A general review of cluster

analysis can be found in many references, for instance, [4, 3, 11], etc. None of

these algorithms can, in general, rigorously guarantee to produce a globally

optimal clustering for non-trivial objective functions [23].

After calculating the pairwise similarities of all documents, we then clas-

sify these documents into different groups by applying the Quasi-Clique

Merge(QCM) method to cluster the documents. It is observed that one of the

most significant differences between the QCM method and other clustering

algorithms is that the QCM method constructs a much smaller hierarchical

tree. This tree structure leads to better identification of meaningful clusters

since there are fewer subdivisions of the data set due to the impact of irrele-

vant or improperly interpreted information. Additionally, the QCM method

results in multi-membership clustering [14], which preserves some amount of

the ambiguity inherent in the data set rather than errantly suppressing it as

many other clustering algorithms do.

Mining and Analyzing Social Networks

Search WWH ::

Custom Search

Home