Database Reference
In-Depth Information
Another extension of weighted intersection that takes into account
the total weight of each token sequence is the cosine similarity.
Cosine similarity is the inner product between two vectors. Each token
sequence can be represented conceptually as a vector in the high dimen-
sional space defined by the token universe Λ (or the cross-product
Λ
for token/position pairs). For example, set s can be represented
as a vector of dimensionality
× N
|
Λ
|
, where each vector coordinate λ is
W ( λ )if λ
s and zero otherwise. This representation is commonly
referred to as the vector space model .
Definition 2.10 (Cosine Similarity). Let s = λ 1 ···
λ n
be two sequences of tokens. The cosine similarity of s and r is defined as
λ s m ,r = λ 1 ···
2 ) 2
( s, r )= (
s
r
C
,
s
r
2
2
2 = s 0
i =1 W ( λ i ) 2 (i.e., the L 2 -norm of token sequence s ).
where
s
Cosine similarity is maximized if and only if the two sequences are the
same.
Clearly, Normalized Weighted Intersection, Jaccard, Dice, and
Cosine similarity are strongly related in a sense that they normal-
ize the similarity with respect to the weight of the token sequences.
Hence, there is no functional difference between those similarity func-
tions. The only difference is semantic and which function works best
depends heavily on application and data characteristics.
An important consideration for weighted similarity functions is the
token weighing scheme used. The simplest weighing scheme is to use
unit weights for all tokens. A more meaningful weighing scheme though,
should assign large weights to tokens that carry larger information con-
tent. As usual, the information content of a token is application and
data dependent. For example, a specific sequence of characters might be
a very rare word in the English language but a very popular occurrence
in non-coding DNA sequences, or a common sequence of phonemes in
Greek. Hence, a variety of weighing schemes have been designed, with
a variety of application domains in mind.
 
Search WWH ::




Custom Search