Finding Similar Items - Mining of Massive Datasets

Databases Reference

In-Depth Information

Non-Euclidean Spaces

Notice that several of the distance measures introduced in this section are

not Euclidean spaces. A property of Euclidean spaces that we shall find

important when we take up clustering in Chapter 7 is that the average

of points in a Euclidean space always exists and is a point in the space.

However, consider the space of sets for which we defined the Jaccard dis-

tance. The notion of the “average” of two sets makes no sense. Likewise,

the space of strings, where we can use the edit distance, does not let us

take the “average” of strings.

Vector spaces, for which we suggested the cosine distance, may or may

not be Euclidean. If the components of the vectors can be any real num-

bers, then the space is Euclidean. However, if we restrict components to

be integers, then the space is not Euclidean. Notice that, for instance, we

cannot find an average of the vectors [1, 2] and [3, 1] in the space of vectors

with two integer components, although if we treated them as members of

the two-dimensional Euclidean space, then we could say that their average

was [2.0, 1.5].

Edit distance is a distance measure. Surely no edit distance can be negative,

and only two identical strings have an edit distance of 0. To see that edit

distance is symmetric, note that a sequence of insertions and deletions can be

reversed, with each insertion becoming a deletion, and vice-versa. The triangle

inequality is also straightforward. One way to turn a string s into a string t

is to turn s into some string u and then turn u into t. Thus, the number of

edits made going from s to u, plus the number of edits made going from u to t

cannot be less than the smallest number of edits that will turn s into t.

3.5.6

Hamming Distance

Given a space of vectors, we define the Hamming distance between two vectors

to be the number of components in which they differ. It should be obvious that

Hamming distance is a distance measure. Clearly the Hamming distance cannot

be negative, and if it is zero, then the vectors are identical. The distance does

not depend on which of two vectors we consider first. The triangle inequality

should also be evident. If x and z differ in m components, and z and y differ in

n components, then x and y cannot differ in more than m+n components. Most

commonly, Hamming distance is used when the vectors are boolean; they consist

of 0's and 1's only. However, in principle, the vectors can have components from

any set.

Example 3.16 : The Hamming distance between the vectors 10101 and 11110

is 3. That is, these vectors differ in the second, fourth, and fifth components,

Search WWH ::

Custom Search

Home