Database Reference
In-Depth Information
The most commonly used weighing scheme for text processing is
based on inverse document frequency weights. Given a set of strings S
and a universe of tokens Λ, the document frequency df ( λ )
Λisthe
number of strings s
S that have at least one occurrence of λ . The
inverse document frequency weight idf ( λ )is
Definition 2.11 (Inverse Document Frequency Weight). Let S
denote a collection of strings and df ( λ )
Λ, the number of strings
s
S with at least one occurrence of λ . The inverse document frequency
weight of λ is defined as:
idf ( λ ) = log 1+ | S |
df ( λ )
.
Alternative definitions of idf weights are also possible. Nevertheless,
they all have a similar flavor. The idf weight is related to the likelihood
that a given token λ appears in a random string s
S . Very frequent
tokens have a high likelihood of appearing in every string, hence they
are assigned small weights. On the other hand, very infrequent tokens
have a very small likelihood of appearing in any string, hence they are
assigned very large weights. The intuition is that two strings that share
a few infrequent tokens must have a large degree of similarity.
Custom weighing schemes are more appropriate in other applica-
tions. A good example is the availability of expert knowledge regarding
the importance of specific tokens (e.g., in biological sequences). Another
example is deriving weights according to various language models.
A problem with using set based similarity functions is that by evalu-
ating the similarity between sets of tokens we lose the ability to identify
spelling mistakes and inconsistencies on a sub-token level. Very similar
tokens belonging to different strings are always considered as a mis-
match by all aforementioned similarity functions. One way to alleviate
this problem is to tokenize strings into overlapping tokens, as will be
discussed in more detail in Section 3. To alleviate some of the problems
associated with edit based and set based similarity functions, hybrid
similarity functions based on combinations thereof have also been con-
sidered. A combination similarity function is derived by defining the
 
Search WWH ::




Custom Search