String Similarity Functions - Approximate String Processing

Database Reference

In-Depth Information

One could also simply normalize by the length of the strings

max( |s|,|r| ) or even by the number of tokens in the strings max( n, m ).

A formulation that normalizes by the weight of tokens in the union

of the two strings is the Jaccard similarity.

Definition 2.7(Jaccard Similarity). Let s = λ 1 ···

λ n

be two sequences of tokens. The Jaccard similarity of s and r is defined

as

λ s m , r = λ 1 ···

( s, r )=

s

∩

r

s

∩

r

1

s ∪ r 1

1

s 1 + r 1 −s ∩ r 1 .

J

=

Here, the similarity between two strings is normalized by the total

weight of the union of their token sets. The larger the weight of the

tokens that the two strings do not have in common is, the smaller the

similarity becomes. The similarity is maximized (i.e., becomes equal

to one) only if the two sequences are the same. Jaccard similarity is a

metric.

One can also define a non-symmetric notion of Jaccard similarity,

commonly referred to as Jaccard containment.

s = λ 1 ···λ s m ,r =

Definition 2.8 (Jaccard

Containment). Let

λ 1 ···

λ n be two sequences of tokens. The Jaccard containment of s

and r is defined as

c ( s, r )= s ∩ r 1

J

.

s

1

The Jaccard containment quantifies the containment of set s in set r .

Jaccard containment is maximized if and only if s ⊆ r .

A related set similarity function is the Dice similarity.

Definition 2.9 (Dice Similarity). Let s = λ 1 ···λ s m ,r = λ 1 ···λ n be

two sequences of tokens. The Dice similarity of s and r is defined as

2

s

∩

r

1

D

( s, r )=

1 .

s

1 +

r

Dice is maximized if and only if the two sequences are the same.

Approximate String Processing

Search WWH ::

Custom Search

Home