Databases Reference
In-Depth Information
Non-Euclidean Spaces
Notice that several of the distance measures introduced in this section are
not Euclidean spaces. A property of Euclidean spaces that we shall find
important when we take up clustering in Chapter 7 is that the average
of points in a Euclidean space always exists and is a point in the space.
However, consider the space of sets for which we defined the Jaccard dis-
tance. The notion of the “average” of two sets makes no sense. Likewise,
the space of strings, where we can use the edit distance, does not let us
take the “average” of strings.
Vector spaces, for which we suggested the cosine distance, may or may
not be Euclidean. If the components of the vectors can be any real num-
bers, then the space is Euclidean. However, if we restrict components to
be integers, then the space is not Euclidean. Notice that, for instance, we
cannot find an average of the vectors [1, 2] and [3, 1] in the space of vectors
with two integer components, although if we treated them as members of
the two-dimensional Euclidean space, then we could say that their average
was [2.0, 1.5].
Edit distance is a distance measure. Surely no edit distance can be negative,
and only two identical strings have an edit distance of 0. To see that edit
distance is symmetric, note that a sequence of insertions and deletions can be
reversed, with each insertion becoming a deletion, and vice-versa. The triangle
inequality is also straightforward. One way to turn a string s into a string t
is to turn s into some string u and then turn u into t. Thus, the number of
edits made going from s to u, plus the number of edits made going from u to t
cannot be less than the smallest number of edits that will turn s into t.
3.5.6
Hamming Distance
Given a space of vectors, we define the Hamming distance between two vectors
to be the number of components in which they differ. It should be obvious that
Hamming distance is a distance measure. Clearly the Hamming distance cannot
be negative, and if it is zero, then the vectors are identical. The distance does
not depend on which of two vectors we consider first. The triangle inequality
should also be evident. If x and z differ in m components, and z and y differ in
n components, then x and y cannot differ in more than m+n components. Most
commonly, Hamming distance is used when the vectors are boolean; they consist
of 0's and 1's only. However, in principle, the vectors can have components from
any set.
Example 3.16 : The Hamming distance between the vectors 10101 and 11110
is 3. That is, these vectors differ in the second, fourth, and fifth components,
Search WWH ::




Custom Search