Database Reference
In-Depth Information
distance between vectors. The Hamming distance can be used to derive
upper bounds on the Intersection, Jaccard, Dice, and Cosine similarity
for any uniform weighing function
W
.
Lemma 7.10.
Assuming a uniform weight function
W
(
λ
)=
c, λ
∈
Λ,
given two strings
s, r
2
θ
I
(
s, r
)
≥
θ
⇒H
(
s, r
)
≤|
s
|
+
|
r
|−
c
.
Proof.
I
(
s, r
)
≥ θ ⇒ c|s ∩ r|≥θ,
and
H
(
s, r
)=
|
s
|−|
s
∩
r
|
+
|
r
|−|
s
∩
r
|
=
|
s
|
+
|
r
|−
2
|
s
∩
r
|⇒
=
|s|
+
|r|−H
(
s, r
)
|
s
∩
r
|
.
2
Hence,
2
θ
2
θ
|
s
|
+
|
r
|−H
(
s, r
)
≥
c
⇒H
(
s, r
)
≤|
s
|
+
|
r
|−
c
.
Similarly
Lemma 7.11.
Assuming a uniform weight function
W
(
λ
)=
c
,
λ
∈
Λ,
given two strings
s, r
•
Normalized Weighted Intersection:
N
(
s, r
)
≥
θ
⇒H
(
s, r
)
≤|
s
|
+
|
r
|−
2
θ
max(
|
s
|
,
|
r
|
)
.
•
Jaccard:
1
θ
1+
θ
(
|s|
+
|r|
)
.
−
J
(
s, r
)
≥ θ ⇒H
(
s, r
)
≤
•
Dice:
D
(
s, r
)
≥
θ
⇒H
(
s, r
)
≤
(1
−
θ
)(
|
s
|
+
|
r
|
)
.
•
Cosine:
2
θ
C
(
s, r
)
≥
θ
⇒H
(
s, r
)
≤|
s
|
+
|
r
|−
|
s
||
r
|
.