Finding Similar Items - Mining of Massive Datasets

Databases Reference

In-Depth Information

3.5

Distance Measures

We now take a short detour to study the general notion of distance measures.

The Jaccard similarity is a measure of how close sets are, although it is not

really a distance measure. That is, the closer sets are, the higher the Jaccard

similarity. Rather, 1 minus the Jaccard similarity is a distance measure, as we

shall see; it is called the Jaccard distance.

However, Jaccard distance is not the only measure of closeness that makes

sense. We shall examine in this section some other distance measures that have

applications. Then, in Section 3.6 we see how some of these distance measures

also have an LSH technique that allows us to focus on nearby points without

comparing all points. Other applications of distance measures will appear when

we study clustering in Chapter 7.

3.5.1 Definition of a Distance Measure

Suppose we have a set of points, called a space. A distance measure on this

space is a function d(x, y) that takes two points in the space as arguments and

produces a real number, and satisfies the following axioms:

1. d(x, y)≥0 (no negative distances).

2. d(x, y) = 0 if and only if x = y (distances are positive, except for the

distance from a point to itself).

3. d(x, y) = d(y, x) (distance is symmetric).

4. d(x, y)≤d(x, z) + d(z, y) (the triangle inequality).

The triangle inequality is the most complex condition. It says, intuitively, that

to travel from x to y, we cannot obtain any benefit if we are forced to travel via

some particular third point z. The triangle-inequality axiom is what makes all

distance measures behave as if distance describes the length of a shortest path

from one point to another.

3.5.2 Euclidean Distances

The most familiar distance measure is the one we normally think of as “dis-

tance.” An n-dimensional Euclidean space is one where points are vectors of n

real numbers. The conventional distance measure in this space, which we shall

refer to as the L 2 -norm, is defined:

n

d([x 1 , x 2 , . . . , x n ], [y 1 , y 2 , . . . , y n ]) =

(x i

−y i ) 2

i=1

That is, we square the distance in each dimension, sum the squares, and take

the positive square root.

Search WWH ::

Custom Search

Home