Database Reference
In-Depth Information
2
String Similarity Functions
There are two types of similarity functions for strings, set based and
edit based. Set based similarity considers strings as sets of tokens and
evaluates the similarity of strings with respect to the similarity of the
corresponding sets. Edit based similarity considers strings as sequences
of characters (or tokens) by assigning absolute positions to each char-
acter (or token) and evaluating string similarity with respect to the
minimum number of edit operations needed to convert one sequence of
characters (or tokens) into the other.
2.1 Edit Based Similarity
Edit based similarity is a very intuitive measure for strings. The
similarity is determined according to the minimum number of edit
operations needed to transform one string into another. Let Σ be
an alphabet. Let string s = σ 1 ···
Σ . Primitive edit opera-
tions consist of insertions, deletions, and replacements of characters.
An insertion I ( s, i, σ ) of character σ
σ i
Σ at position i of string s
results in a new string s of length +1 ,s = σ 1 ···
σ .A
deletion D ( s, i ) removes character σ i resulting in a new string s of
σ i− 1 σσ i ···
274
 
Search WWH ::




Custom Search