Database Reference
In-Depth Information
There are other distance measures that have been used for Euclidean spaces. For any
constant
r
, we can define the
L
r
-norm
to be the distance measure
d
defined by:
The case
r
= 2 is the usual
L
2
-norm just mentioned. Another common distance measure is
the
L
1
-norm, or
Manhattan distance
. There, the distance between two points is the sum of
the magnitudes of the differences in each dimension. It is called “Manhattan distance” be-
cause it is the distance one would have to travel between points if one were constrained to
travel along grid lines, as on the streets of a city such as Manhattan.
Another interesting distance measure is the
L
∞
-norm, which is the limit as
r
approaches
infinity of the
L
r
-norm. As
r
gets larger, only the dimension with the largest difference mat-
ters, so formally, the
L
∞
-norm is defined as the maximum of |
x
i
−
y
i
| over all dimensions
i
.
EXAMPLE
3.12 Consider the two-dimensional Euclidean space (the customary plane) and
the points (2
,
7) and (6
,
4). The
L
2
-norm gives a distance of The
L
1
-norm
gives a distance of |2 − 6| + |7 − 4| = 4 + 3 = 7. The
L
∞
-norm gives a distance of
max(|2 − 6|
,
|7 − 4|) = max(4
,
3) = 4
.
□
3.5.3
Jaccard Distance
As mentioned at the beginning of the section, we define the
Jaccard distance
of sets by
d
(
x,
y
) = 1 − SIM(
x, y
). That is, the Jaccard distance is 1 minus the ratio of the sizes of the inter-
section and union of sets
x
and
y
. We must verify that this function is a distance measure.
(1)
d
(
x, y
) is nonnegative because the size of the intersection cannot exceed the size of the
union.
(2)
d
(
x, y
) = 0 if
x
=
y
, because
x
∪
x
=
x
∩
x
=
x
. However, if
x
≠
y
, then the size of
x
∩
y
is strictly less than the size of
x
∪
y
, so
d
(
x, y
) is strictly positive.
(3)
d
(
x, y
) =
d
(
y, x
) because both union and intersection are symmetric; i.e.,
x
∪
y
=
y
∪
x
and
x
∩
y
=
y
∩
x
.
(4) For the triangle inequality, recall from
Section 3.3.3
that SIM(
x, y
) is the probability a
random minhash function maps
x
and
y
to the same value. Thus, the Jaccard distance
d
(
x, y
) is the probability that a random minhash function
does not
send
x
and
y
to the
same value. We can therefore translate the condition
d
(
x, y
) ≤
d
(
x, z
)+
d
(
z, y
) to the
statement that if
h
is a random minhash function, then the probability that
h
(
x
) ≠
h
(
y
)
is no greater than the sum of the probability that
h
(
x
) ≠
h
(
z
) and the probability that