Database Reference
In-Depth Information
as a sequence of tokens is to use a set of token/position pairs
s =
( λ 1 , 1) ,..., ( λ s m ,m )
. A weaker string representation is to sacri-
fice the positional information and only preserve the number of times
each token appears in the string.
{
}
Definition 2.2 (Token Frequency). The token frequency f s ( λ )is
the number of occurrences of token λ in string s . When clear from
context, we simply write f ( λ ) to refer to the token frequency of λ in a
certain string.
Using token frequencies a string can be represented as the set s =
{
( λ 1 ,f ( λ 1 )) ,..., ( λ n ,f ( λ n ))
m ). Here the order of
tokens is lost. We refer to this representation as a frequency-set of
tokens. An even weaker representation is to discard the token frequency
and consider strings as simple sets of tokens, i.e., s =
}
(notice that n
λ 1 ,...,λ n }
.We
differentiate between these three string representations as sequences ,
frequency-sets , and sets .
It should be stressed here that all three representations are explic-
itly defined to be sets of elements (as opposed to multi-sets or bags).
Hence, in what follows, all intersection and union predicates operate
on sets. Notice also that the most general representation of the three
is sequences. When indexing sequences, one can easily disregard the
positional information of the tokens and treat the strings either as
frequency-sets or sets. Obviously, the particular interpretation of a
string as a sequence, a frequency-set or a set has a significant influ-
ence on the semantics of the similarity between strings, with sequences
being the most strict interpretation (strings not only have to agree on
a large number of tokens being similar, but the tokens have to have
similar positions within the string as well), and sets being the loosest
(the similarity of strings depends only on the number of tokens shared,
rather than the position or the multiplicity of those tokens).
Let W
{
+ be a function that assigns a positive real value as
a weight of each token in Λ. The simplest function for evaluating the
similarity between two strings is the weighted intersection of the token
sequences.
R
 
Search WWH ::




Custom Search