Graph Model for Pattern Recognition in Text - Mining and Analyzing Social Networks - page 13

Information Technology Reference

In-Depth Information

Input: The collection of sets

P

( K μ ) for all keyword K μ

(provided in

Algorithm 3.3.1).

Output: An 1

( m + m 2 )-vector R ( D i ).

×

Calculation:

Let F ( D i )=[ f μ ]=[ f 1 ,

···

,f m ]bea(1

×

m )-matrix where f μ =

|P

( K μ )

|

.

Let W ( D i )=[ α μ,ν ]bean( m

m )-matrix with

α μ,ν =

×

1

p μ −

p ν +1

where the summation is over all pairs p μ ∈P

( K μ )and p ν ∈P

( K ν )with

p μ >p ν .

Note: this is not a symmetric matrix, parallel arcs in opposite directions

in the graph are considered differently.

Let R ( D i )=[ F ( D i ) , W ( D i )]. Here we rewrite the matrix W ( D i )asa

m 2 row vector W ( D i ).

Complexity: It costs

1

×

|P

( K μ )

|

for the calculation of f μ ,anditcosts

|P

( K μ )

||P

( K ν )

|

for the calculation of α μ,ν

and α ν,μ

for every μ, ν

∈

. So, the complexity is O ( m 2 φ 2 ), where φ = average of

{

1 ,

···

,m

}

|P

( K μ )

|

(the average appearance of a keyword in the document D i ).

3.5 Similarity Calculation

Let

D

=

{

D 1 ,

···

, D n }

be a set of documents.

3.5.1 Date Normalization

Input: Let R ( D i )bethe(1

×

( m + m 2 )) vector calculated in Section 3.4.

Additionally let M (

( m + m 2 ))-matrix, with the i -th

row the signature vector R ( D i ), so that β i,j

D

)=[ β i,j ]bethe( n

×

is the j -th component of the

signature vector R ( D i ).

Calculation and output: For each j

,m + m 2

∈{

1 ,

···

}

,let A j be the average

of all cells in the j -th column of the matrix M (

D

).

):=[ β i,j ]=[ β i,j

A j

M (

D

] .

Complexity: O ( nm 2 )

3.5.2 Similarity

The similarity between two documents D a and D b is then calculated as

Next Page

Mining and Analyzing Social Networks

Search WWH ::

Custom Search

Home