Database Reference
In-Depth Information
Definition 2.3 (Weighted
Intersection
on
Sequences). Let
( λ 1 , 1) ,..., ( λ s m ,m )
( λ 1 , 1) ,..., ( λ n ,n )
, λ i i
s =
Λ, be two
sequences of tokens. The weighted intersection of s and r is defined as
{
}
, r =
{
}
( s, r )=
( λ,p ) ∈s∩r
I
W ( λ ) .
The intuition here is simple. If two strings share enough heavy
tokens/position pairs or a large number of light token/position pairs
then the strings are potentially very similar. Clearly, intersection is a
symmetric similarity measure. This definition uses the absolute posi-
tion of tokens within the sequences; if the common token does not
appear in exactly the same position within the two strings then it is
not included in the intersection. For example, the two sequences 'The
Bill & Melinda Gates Foundation' and 'The Melinda & Bill Gates Foun-
dation' have only four token/position pairs in common. One can extend
the definition to consider edit based similarity, where the distance in
the position of a common token between two strings is allowed to be
within user specified bounds, instead of requiring it to be exactly the
same.
Depending on application characteristics, it might be important to
consider similarity of strings without regard for token positions. For
example, when tokens are words and word order is not important (con-
sider 'The Bill & Melinda Gates Foundation' and 'The Melinda & Bill
Gates Foundation'), or when one is interested in evaluating similarity
on substrings (consider 'American Red Cross' and 'Red Cross'). For
that purpose weighted intersection can be defined on frequency-sets.
Definition 2.4 (Weighted Intersection on Frequency-sets). Let
s =( λ 1 ,f ( λ 1 )) ,..., ( λ s m ,f ( λ s m )) ,r =( λ 1 ,f ( λ 1 )) ,..., ( λ n ,f ( λ n )) i i
Λ, be two frequency-sets of tokens. The weighted intersection of s and
r is defined as
( s, r )=
λ
I
min( f s ( λ ) ,f r ( λ )) W ( λ ) .
s
r
 
Search WWH ::




Custom Search