Databases Reference
In-Depth Information
2. d(x, y) = 0 if x = y, because x∪x = x∩x = x. However, if x = y, then
the size of x∩y is strictly less than the size of x∪y, so d(x, y) is strictly
positive.
3. d(x, y) = d(y, x) because both union and intersection are symmetric; i.e.,
x∪y = y∪x and x∩y = y∩x.
4. For the triangle inequality, recall from Section 3.3.3 that SIM (x, y) is the
probability a random minhash function maps x and y to the same value.
Thus, the Jaccard distance d(x, y) is the probability that a random min-
hash function does not send x and y to the same value. We can therefore
translate the condition d(x, y)≤d(x, z) + d(z, y) to the statement that if
h is a random minhash function, then the probability that h(x) = h(y)
is no greater than the sum of the probability that h(x) = h(z) and the
probability that h(z) = h(y). However, this statement is true because
whenever h(x) = h(y), at least one of h(x) and h(y) must be different
from h(z). They could not both be h(z), because then h(x) and h(y)
would be the same.
.
3.5.4 Cosine Distance
The cosine distance makes sense in spaces that have dimensions, including Eu-
clidean spaces and discrete versions of Euclidean spaces, such as spaces where
points are vectors with integer components or boolean (0 or 1) components. In
such a space, points may be thought of as directions. We do not distinguish be-
tween a vector and a multiple of that vector. Then the cosine distance between
two points is the angle that the vectors to those points make. This angle will
be in the range 0 to 180 degrees, regardless of how many dimensions the space
has.
We can calculate the cosine distance by first computing the cosine of the
angle, and then applying the arc-cosine function to translate to an angle in the
0-180 degree range. Given two vectors x and y, the cosine of the angle between
them is the dot product x.y divided by the L 2 -norms of x and y (i.e., their
Euclidean distances from the origin).
Recall that the dot product of vectors
n
i=1 x i y i .
[x 1 , x 2 , . . . , x n ].[y 1 , y 2 , . . . , y n ] is
Example 3.13 : Let our two vectors be x = [1, 2,−1] and = [2, 1, 1]. The dot
product x.y is 1×2 + 2×1 + (−1)×1 = 3. The L 2 -norm of both vectors is
6. For example, x has L 2 -norm
1 2 + 2 2 + (−1) 2
=
6. Thus, the cosine of
the angle between x and y is 3/(
6) or 1/2. The angle whose cosine is 1/2
is 60 degrees, so that is the cosine distance between x and y.
6
2
We must show that the cosine distance is indeed a distance measure. We
have defined it so the values are in the range 0 to 180, so no negative distances
Search WWH ::




Custom Search