Finding Similar Items - Mining of Massive Datasets

Databases Reference

In-Depth Information

It is easy to verify the first three requirements for a distance measure are

satisfied. The Euclidean distance between two points cannot be negative, be-

cause the positive square root is intended. Since all squares of real numbers are

nonnegative, any i such that x i = y i forces the distance to be strictly positive.

On the other hand, if x i = y i for all i, then the distance is clearly 0. Symmetry

follows because (x i

−x i ) 2 . The triangle inequality requires a good

deal of algebra to verify. However, it is well understood to be a property of

Euclidean space: the sum of the lengths of any two sides of a triangle is no less

than the length of the third side.

There are other distance measures that have been used for Euclidean spaces.

For any constant r, we can define the L r -norm to be the distance measure d

defined by:

−y i ) 2

= (y i

n

| r ) 1/r

d([x 1 , x 2 , . . . , x n ], [y 1 , y 2 , . . . , y n ]) = (

|x i

−y i

i=1

The case r = 2 is the usual L 2 -norm just mentioned. Another common distance

measure is the L 1 -norm, or Manhattan distance. There, the distance between

two points is the sum of the magnitudes of the differences in each dimension.

It is called “Manhattan distance” because it is the distance one would have to

travel between points if one were constrained to travel along grid lines, as on

the streets of a city such as Manhattan.

Another interesting distance measure is the L ∞ -norm, which is the limit

as r approaches infinity of the L r -norm. As r gets larger, only the dimension

with the largest difference matters, so formally, the L ∞ -norm is defined as the

maximum of|x i −y i |over all dimensions i.

Example 3.12 : Consider the two-dimensional Euclidean space (the custom-

ary plane) and the points (2, 7) and (6, 4).

The L 2 -norm gives a distance

√

of

4 2 + 3 2 = 5. The L 1 -norm gives a distance of

|2−6|+|7−4|= 4 + 3 = 7. The L ∞ -norm gives a distance of

(2−6) 2 + (7−4) 2

=

max(|2−6|,|7−4|) = max(4, 3) = 4

2

3.5.3

Jaccard Distance

As mentioned at the beginning of the section, we define the Jaccard distance

of sets by d(x, y) = 1− SIM (x, y). That is, the Jaccard distance is 1 minus the

ratio of the sizes of the intersection and union of sets x and y. We must verify

that this function is a distance measure.

1. d(x, y) is nonnegative because the size of the intersection cannot exceed

the size of the union.

Mining of Massive Datasets

Search WWH ::

Custom Search

Home