String Similarity Functions - Approximate String Processing

Database Reference

In-Depth Information

example, the two strings 'One Laptop per Child' and 'One Child per

Laptop' have edit distance twelve. Clearly, a better way to compare

these two strings using edit distance is to first decompose them into

individual words and compare each word separately first, then deter-

mine the similarity between the complete strings as a weighted function

of matching words, their respective edit distance, and possibly their

relative positions in the two strings. Of course, the cost of compar-

ing the strings now becomes quadratic in the number of words in the

strings.

2.2 Set Based Similarity

Set based similarity decomposes strings into sets using a variety of

tokenization methods and evaluates similarity of strings with respect

to the similarity of their sets. A variety of set similarity functions can

be used for that purpose, all of which have a similar flavor: Each set

element (or token) is assigned a weight and the similarity between the

sets is computed as a weighted function of the tokens in the intersection

and/or the complement of the intersection of the sets. The application

characteristics heavily influence, first, the tokens to extract, second, the

token weights, and third, the type of similarity function used.

Strings are modeled as sequences of characters, nevertheless a string

can also be represented as a set or a multi-set of tokens, based on

the tokenization function used. Let Λ be a token universe. Let s =

λ 1 λ 2 ···

λ s m ,λ i ∈

Λ be a string consisting of a sequence of m tokens. This

definition is a generalization of the definition of strings as a sequence of

characters (recall that s = σ 1

σ ). If the chosen tokenization returns

each character in the string as an individual token, then the two def-

initions are the same, and m = =

···

|

(in other words, the number of tokens in s is not always equal to the

number of characters in s ). In the rest, for clarity we will always refer

to the number of characters in s with

|

s

|

. In general, notice that m

=

|

s

|

s

|

and the number of tokens in

s with

0 (i.e., the L 0 -norm of set s ), irrespective of the context of

the similarity function and the specific tokenization used.

Implicitly, each token in the sequence above is assigned a posi-

tion in the string. A more explicit way of representing the string

s

Approximate String Processing

Search WWH ::

Custom Search

Home