Information Technology Reference
In-Depth Information
Input: The collection of sets
P
( K μ ) for all keyword K μ
(provided in
Algorithm 3.3.1).
Output: An 1
( m + m 2 )-vector R ( D i ).
×
Calculation:
Let F ( D i )=[ f μ ]=[ f 1 ,
···
,f m ]bea(1
×
m )-matrix where f μ =
|P
( K μ )
|
.
Let W ( D i )=[ α μ,ν ]bean( m
m )-matrix with
α μ,ν =
×
1
p μ
p ν +1
where the summation is over all pairs p μ ∈P
( K μ )and p ν ∈P
( K ν )with
p μ >p ν .
Note: this is not a symmetric matrix, parallel arcs in opposite directions
in the graph are considered differently.
Let R ( D i )=[ F ( D i ) , W ( D i )]. Here we rewrite the matrix W ( D i )asa
m 2 row vector W ( D i ).
Complexity: It costs
1
×
|P
( K μ )
|
for the calculation of f μ ,anditcosts
|P
( K μ )
||P
( K ν )
|
for the calculation of α μ,ν
and α ν,μ
for every μ, ν
. So, the complexity is O ( m 2 φ 2 ), where φ = average of
{
1 ,
···
,m
}
|P
( K μ )
|
(the average appearance of a keyword in the document D i ).
3.5 Similarity Calculation
Let
D
=
{
D 1 ,
···
, D n }
be a set of documents.
3.5.1 Date Normalization
Input: Let R ( D i )bethe(1
×
( m + m 2 )) vector calculated in Section 3.4.
Additionally let M (
( m + m 2 ))-matrix, with the i -th
row the signature vector R ( D i ), so that β i,j
D
)=[ β i,j ]bethe( n
×
is the j -th component of the
signature vector R ( D i ).
Calculation and output: For each j
,m + m 2
∈{
1 ,
···
}
,let A j be the average
of all cells in the j -th column of the matrix M (
D
).
):=[ β i,j ]=[ β i,j
A j
M (
D
] .
Complexity: O ( nm 2 )
3.5.2 Similarity
The similarity between two documents D a and D b is then calculated as
 
Search WWH ::




Custom Search