Database Reference
In-Depth Information
example, the two strings 'One Laptop per Child' and 'One Child per
Laptop' have edit distance twelve. Clearly, a better way to compare
these two strings using edit distance is to first decompose them into
individual words and compare each word separately first, then deter-
mine the similarity between the complete strings as a weighted function
of matching words, their respective edit distance, and possibly their
relative positions in the two strings. Of course, the cost of compar-
ing the strings now becomes quadratic in the number of words in the
strings.
2.2 Set Based Similarity
Set based similarity decomposes strings into sets using a variety of
tokenization methods and evaluates similarity of strings with respect
to the similarity of their sets. A variety of set similarity functions can
be used for that purpose, all of which have a similar flavor: Each set
element (or token) is assigned a weight and the similarity between the
sets is computed as a weighted function of the tokens in the intersection
and/or the complement of the intersection of the sets. The application
characteristics heavily influence, first, the tokens to extract, second, the
token weights, and third, the type of similarity function used.
Strings are modeled as sequences of characters, nevertheless a string
can also be represented as a set or a multi-set of tokens, based on
the tokenization function used. Let Λ be a token universe. Let s =
λ 1 λ 2 ···
λ s m i
Λ be a string consisting of a sequence of m tokens. This
definition is a generalization of the definition of strings as a sequence of
characters (recall that s = σ 1
σ ). If the chosen tokenization returns
each character in the string as an individual token, then the two def-
initions are the same, and m = =
···
|
(in other words, the number of tokens in s is not always equal to the
number of characters in s ). In the rest, for clarity we will always refer
to the number of characters in s with
|
s
|
. In general, notice that m
=
|
s
|
s
|
and the number of tokens in
s with
0 (i.e., the L 0 -norm of set s ), irrespective of the context of
the similarity function and the specific tokenization used.
Implicitly, each token in the sequence above is assigned a posi-
tion in the string. A more explicit way of representing the string
s
Search WWH ::




Custom Search