Database Reference
In-Depth Information
One could also simply normalize by the length of the strings
max( |s|,|r| ) or even by the number of tokens in the strings max( n, m ).
A formulation that normalizes by the weight of tokens in the union
of the two strings is the Jaccard similarity.
Definition 2.7(Jaccard Similarity). Let s = λ 1 ···
λ n
be two sequences of tokens. The Jaccard similarity of s and r is defined
as
λ s m , r = λ 1 ···
( s, r )=
s
r
s
r
1
s ∪ r 1
1
s 1 + r 1 −s ∩ r 1 .
J
=
Here, the similarity between two strings is normalized by the total
weight of the union of their token sets. The larger the weight of the
tokens that the two strings do not have in common is, the smaller the
similarity becomes. The similarity is maximized (i.e., becomes equal
to one) only if the two sequences are the same. Jaccard similarity is a
metric.
One can also define a non-symmetric notion of Jaccard similarity,
commonly referred to as Jaccard containment.
s = λ 1 ···λ s m ,r =
Definition 2.8 (Jaccard
Containment). Let
λ 1 ···
λ n be two sequences of tokens. The Jaccard containment of s
and r is defined as
c ( s, r )= s r 1
J
.
s
1
The Jaccard containment quantifies the containment of set s in set r .
Jaccard containment is maximized if and only if s ⊆ r .
A related set similarity function is the Dice similarity.
Definition 2.9 (Dice Similarity). Let s = λ 1 ···λ s m ,r = λ 1 ···λ n be
two sequences of tokens. The Dice similarity of s and r is defined as
2
s
r
1
D
( s, r )=
1 .
s
1 +
r
Dice is maximized if and only if the two sequences are the same.
 
Search WWH ::




Custom Search