Database Reference
In-Depth Information
There are other distance measures that have been used for Euclidean spaces. For any
constant r , we can define the L r -norm to be the distance measure d defined by:
The case r = 2 is the usual L 2 -norm just mentioned. Another common distance measure is
the L 1 -norm, or Manhattan distance . There, the distance between two points is the sum of
the magnitudes of the differences in each dimension. It is called “Manhattan distance” be-
cause it is the distance one would have to travel between points if one were constrained to
travel along grid lines, as on the streets of a city such as Manhattan.
Another interesting distance measure is the L -norm, which is the limit as r approaches
infinity of the L r -norm. As r gets larger, only the dimension with the largest difference mat-
ters, so formally, the L -norm is defined as the maximum of | x i y i | over all dimensions i .
EXAMPLE 3.12 Consider the two-dimensional Euclidean space (the customary plane) and
the points (2 , 7) and (6 , 4). The L 2 -norm gives a distance of The L 1 -norm
gives a distance of |2 − 6| + |7 − 4| = 4 + 3 = 7. The L -norm gives a distance of
max(|2 − 6| , |7 − 4|) = max(4 , 3) = 4 .
3.5.3
Jaccard Distance
As mentioned at the beginning of the section, we define the Jaccard distance of sets by d ( x,
y ) = 1 − SIM( x, y ). That is, the Jaccard distance is 1 minus the ratio of the sizes of the inter-
section and union of sets x and y . We must verify that this function is a distance measure.
(1) d ( x, y ) is nonnegative because the size of the intersection cannot exceed the size of the
union.
(2) d ( x, y ) = 0 if x = y , because x x = x x = x . However, if x y , then the size of x y
is strictly less than the size of x y , so d ( x, y ) is strictly positive.
(3) d ( x, y ) = d ( y, x ) because both union and intersection are symmetric; i.e., x y = y
x and x y = y x .
(4) For the triangle inequality, recall from Section 3.3.3 that SIM( x, y ) is the probability a
random minhash function maps x and y to the same value. Thus, the Jaccard distance
d ( x, y ) is the probability that a random minhash function does not send x and y to the
same value. We can therefore translate the condition d ( x, y ) ≤ d ( x, z )+ d ( z, y ) to the
statement that if h is a random minhash function, then the probability that h ( x ) ≠ h ( y )
is no greater than the sum of the probability that h ( x ) ≠ h ( z ) and the probability that
Search WWH ::




Custom Search