String Similarity Functions - Approximate String Processing

Database Reference

In-Depth Information

Notice that this definition implicitly treats frequency-set intersec-

tion as intersection on bags of tokens, by virtue of the min operation. In

other words, two frequency-sets do not have to agree both on the token

and the exact frequency of that token, for the token to be considered

in the intersection.

Finally, the definition can be modified to disregard token positions

and multiplicity.

λ 1 ,... ,

Definition 2.5 (Weighted Intersection on Sets). Let s =

{

λ s m }

λ 1 ,...,λ n }

,λ i ,λ i

Λ, be two sets of tokens. The weighted

intersection of s and r is defined as

I ( s, r )=

λ∈s∩r

,r =

{

∈

W ( λ ) .

Representing strings as sequences, frequency-sets or sets of tokens is

application dependent and is applicable for all similarity functions

introduced below. In the rest we refer to strings as sequences, which is

the most general case. Extending the definitions to frequency-sets and

sets is straightforward.

Notice that the definitions above do not take into account the weight

or number of tokens that the two strings do not have in common (i.e.,

in the complement of their intersection). In certain applications, it is

required for strings to have similar lengths (either similar number of

tokens or similar total token weight). One could use various forms of

normalization to address this issue. A simple technique is to divide the

weighted intersection by the maximum sequence weight.

Definition 2.6 (Normalized Weighted Intersection). Let s =

λ 1 ···

λ n be two sequences of tokens. The normalized

weighted intersection of s and r is defined as

λ s m ,r = λ 1 ···

s

∩

r

1

N ( s, r )=

1 ) ,

max(

s

1 ,

r

1 = s 0

i =1 W ( λ i ) (i.e., the L 1 -norm of token sequence s ).

where

s

Approximate String Processing

Search WWH ::

Custom Search

Home