String Similarity Functions - Approximate String Processing

Database Reference

In-Depth Information

The most commonly used weighing scheme for text processing is

based on inverse document frequency weights. Given a set of strings S

and a universe of tokens Λ, the document frequency df ( λ ) ,λ

∈

Λisthe

number of strings s

S that have at least one occurrence of λ . The

inverse document frequency weight idf ( λ )is

∈

Definition 2.11 (Inverse Document Frequency Weight). Let S

denote a collection of strings and df ( λ ) ,λ

∈

Λ, the number of strings

s

S with at least one occurrence of λ . The inverse document frequency

weight of λ is defined as:

idf ( λ ) = log 1+ | S |

df ( λ )

∈

.

Alternative definitions of idf weights are also possible. Nevertheless,

they all have a similar flavor. The idf weight is related to the likelihood

that a given token λ appears in a random string s

S . Very frequent

tokens have a high likelihood of appearing in every string, hence they

are assigned small weights. On the other hand, very infrequent tokens

have a very small likelihood of appearing in any string, hence they are

assigned very large weights. The intuition is that two strings that share

a few infrequent tokens must have a large degree of similarity.

Custom weighing schemes are more appropriate in other applica-

tions. A good example is the availability of expert knowledge regarding

the importance of specific tokens (e.g., in biological sequences). Another

example is deriving weights according to various language models.

A problem with using set based similarity functions is that by evalu-

ating the similarity between sets of tokens we lose the ability to identify

spelling mistakes and inconsistencies on a sub-token level. Very similar

tokens belonging to different strings are always considered as a mis-

match by all aforementioned similarity functions. One way to alleviate

this problem is to tokenize strings into overlapping tokens, as will be

discussed in more detail in Section 3. To alleviate some of the problems

associated with edit based and set based similarity functions, hybrid

similarity functions based on combinations thereof have also been con-

sidered. A combination similarity function is derived by defining the

∈

Approximate String Processing

Search WWH ::

Custom Search

Home