Information Technology Reference
In-Depth Information
Input:
The collection of sets
P
(
K
μ
) for all keyword
K
μ
(provided in
Algorithm 3.3.1).
Output:
An 1
(
m
+
m
2
)-vector
R
(
D
i
).
×
Calculation:
Let
F
(
D
i
)=[
f
μ
]=[
f
1
,
···
,f
m
]bea(1
×
m
)-matrix where
f
μ
=
|P
(
K
μ
)
|
.
Let
W
(
D
i
)=[
α
μ,ν
]bean(
m
m
)-matrix with
α
μ,ν
=
×
1
p
μ
−
p
ν
+1
where the summation is over all pairs
p
μ
∈P
(
K
μ
)and
p
ν
∈P
(
K
ν
)with
p
μ
>p
ν
.
Note: this is not a symmetric matrix, parallel arcs in opposite directions
in the graph are considered differently.
Let
R
(
D
i
)=[
F
(
D
i
)
, W
(
D
i
)]. Here we rewrite the matrix
W
(
D
i
)asa
m
2
row vector
W
(
D
i
).
Complexity:
It costs
1
×
|P
(
K
μ
)
|
for the calculation of
f
μ
,anditcosts
|P
(
K
μ
)
||P
(
K
ν
)
|
for the calculation of
α
μ,ν
and
α
ν,μ
for every
μ, ν
∈
. So, the complexity is
O
(
m
2
φ
2
), where
φ
= average of
{
1
,
···
,m
}
|P
(
K
μ
)
|
(the average appearance of a keyword in the document
D
i
).
3.5 Similarity Calculation
Let
D
=
{
D
1
,
···
, D
n
}
be a set of documents.
3.5.1 Date Normalization
Input:
Let
R
(
D
i
)bethe(1
×
(
m
+
m
2
)) vector calculated in Section 3.4.
Additionally let
M
(
(
m
+
m
2
))-matrix, with the
i
-th
row the signature vector
R
(
D
i
), so that
β
i,j
D
)=[
β
i,j
]bethe(
n
×
is the
j
-th component of the
signature vector
R
(
D
i
).
Calculation and output: For each
j
,m
+
m
2
∈{
1
,
···
}
,let
A
j
be the average
of all cells in the
j
-th column of the matrix
M
(
D
).
):=[
β
i,j
]=[
β
i,j
A
j
M
(
D
]
.
Complexity:
O
(
nm
2
)
3.5.2 Similarity
The similarity between two documents
D
a
and
D
b
is then calculated as