String Similarity Functions - Approximate String Processing

Database Reference

In-Depth Information

as a sequence of tokens is to use a set of token/position pairs

s =

( λ 1 , 1) ,..., ( λ s m ,m )

. A weaker string representation is to sacri-

fice the positional information and only preserve the number of times

each token appears in the string.

{

}

Definition 2.2 (Token Frequency). The token frequency f s ( λ )is

the number of occurrences of token λ in string s . When clear from

context, we simply write f ( λ ) to refer to the token frequency of λ in a

certain string.

Using token frequencies a string can be represented as the set s =

{

( λ 1 ,f ( λ 1 )) ,..., ( λ n ,f ( λ n ))

m ). Here the order of

tokens is lost. We refer to this representation as a frequency-set of

tokens. An even weaker representation is to discard the token frequency

and consider strings as simple sets of tokens, i.e., s =

}

(notice that n

≤

λ 1 ,...,λ n }

.We

differentiate between these three string representations as sequences ,

frequency-sets , and sets .

It should be stressed here that all three representations are explic-

itly defined to be sets of elements (as opposed to multi-sets or bags).

Hence, in what follows, all intersection and union predicates operate

on sets. Notice also that the most general representation of the three

is sequences. When indexing sequences, one can easily disregard the

positional information of the tokens and treat the strings either as

frequency-sets or sets. Obviously, the particular interpretation of a

string as a sequence, a frequency-set or a set has a significant influ-

ence on the semantics of the similarity between strings, with sequences

being the most strict interpretation (strings not only have to agree on

a large number of tokens being similar, but the tokens have to have

similar positions within the string as well), and sets being the loosest

(the similarity of strings depends only on the number of tokens shared,

rather than the position or the multiplicity of those tokens).

Let W :Λ

{

+ be a function that assigns a positive real value as

a weight of each token in Λ. The simplest function for evaluating the

similarity between two strings is the weighted intersection of the token

sequences.

→ R

Approximate String Processing

Search WWH ::

Custom Search

Home