Databases Reference
In-Depth Information
similarity function, we have
x y
jj x jjjj y jj
sim
.
x , y
/D
,
(2.23)
where jj x jj is
, defined as
q x 1 C x 2 CC x p . Conceptually, it is the length of the vector. Similarly, jj y jj is the
Euclidean norm of vector y . The measure computes the cosine of the angle between vec-
tors x and y . A cosine value of 0 means that the two vectors are at 90 degrees to each
other (orthogonal) and have no match. The closer the cosine value to 1, the smaller the
angle and the greater the match between vectors. Note that because the cosine similarity
measure does not obey all of the properties of Section 2.4.4 defining metric measures, it
is referred to as a nonmetric measure .
the
Euclidean
norm
of
vector x D.
x 1 , x 2 ,
:::
, x p /
Example2.23 Cosine similarity between two term-frequency vectors. Suppose that x and y are the
first two term-frequency vectors in Table 2.5. That is, x D.
5, 0, 3, 0, 2, 0, 0, 2, 0, 0
/
and
y D.
. How similar are x and y ? Using Eq. (2.23) to compute the
cosine similarity between the two vectors, we get:
3, 0, 2, 0, 1, 1, 0, 1, 0, 1
/
x t y D 53C00C32C00C21C01C00C21
C00C01 D 25
p 5 2 C0 2 C3 2 C0 2 C2 2 C0 2 C0 2 C2 2 C0 2 C0 2 D 6.48
jj x jjD
p 3 2 C0 2 C2 2 C0 2 C1 2 C1 2 C0 2 C1 2 C0 2 C1 2 D 4.12
jj y jjD
sim
.
x , y
/D 0.94
Therefore, if we were using the cosine similarity measure to compare these documents,
they would be considered quite similar.
When attributes are binary-valued, the cosine similarity function can be interpreted
in terms of shared features or attributes. Suppose an object x possesses the i th attribute
if x i D 1. Then x t y is the number of attributes possessed (i.e., shared) by both x and
y , and j x jj y j is the geometric mean of the number of attributes possessed by x and the
number possessed by y . Thus, sim
.
x , y
/
is a measure of relative possession of common
attributes.
A simple variation of cosine similarity for the preceding scenario is
x y
x x C y y x y ,
sim
.
x , y
/D
(2.24)
which is the ratio of the number of attributes shared by x and y to the number of
attributes possessed by x or y . This function, known as the Tanimoto coefficient or
Tanimoto distance , is frequently used in information retrieval and biology taxonomy.
 
Search WWH ::




Custom Search