Database Reference
In-Depth Information
Notice that this definition implicitly treats frequency-set intersec-
tion as intersection on bags of tokens, by virtue of the min operation. In
other words, two frequency-sets do not have to agree both on the token
and the exact frequency of that token, for the token to be considered
in the intersection.
Finally, the definition can be modified to disregard token positions
and multiplicity.
λ 1 ,... ,
Definition 2.5 (Weighted Intersection on Sets). Let s =
{
λ s m }
λ 1 ,...,λ n }
i i
Λ, be two sets of tokens. The weighted
intersection of s and r is defined as
I ( s, r )=
λ∈s∩r
,r =
{
W ( λ ) .
Representing strings as sequences, frequency-sets or sets of tokens is
application dependent and is applicable for all similarity functions
introduced below. In the rest we refer to strings as sequences, which is
the most general case. Extending the definitions to frequency-sets and
sets is straightforward.
Notice that the definitions above do not take into account the weight
or number of tokens that the two strings do not have in common (i.e.,
in the complement of their intersection). In certain applications, it is
required for strings to have similar lengths (either similar number of
tokens or similar total token weight). One could use various forms of
normalization to address this issue. A simple technique is to divide the
weighted intersection by the maximum sequence weight.
Definition 2.6 (Normalized Weighted Intersection). Let s =
λ 1 ···
λ n be two sequences of tokens. The normalized
weighted intersection of s and r is defined as
λ s m ,r = λ 1 ···
s
r
1
N ( s, r )=
1 ) ,
max(
s
1 ,
r
1 = s 0
i =1 W ( λ i ) (i.e., the L 1 -norm of token sequence s ).
where
s
 
Search WWH ::




Custom Search