Database Reference
In-Depth Information
clustering. In particular, a Euclidean space's points are vectors of real numbers. The length
of the vector is the number of dimensions of the space. The components of the vector are
commonly called coordinates of the represented points.
All spaces for which we can perform a clustering have a distance measure, giving a dis-
tance between any two points in the space. We introduced distances in Section 3.5 . The
common Euclidean distance (square root of the sums of the squares of the differences
between the coordinates of the points in each dimension) serves for all Euclidean spaces,
although we also mentioned some other options for distance measures in Euclidean spaces,
including the Manhattan distance (sum of the magnitudes of the differences in each dimen-
sion) and the L -distance (maximum magnitude of the difference in any dimension).
EXAMPLE 7.1 Classical applications of clustering often involve low-dimensional Euclidean
spaces. For example, Fig. 7.1 shows height and weight measurements of dogs of several
varieties. Without knowing which dog is of which variety, we can see just by looking at
the diagram that the dogs fall into three clusters, and those clusters happen to correspond
to three varieties. With small amounts of data, any clustering algorithm will establish the
correct clusters, and simply plotting the points and “eyeballing” the plot will suffice as
well.
Figure 7.1 Heights and weights of dogs taken from three varieties
However, modern clustering problems are not so simple. They may involve Euclidean
spaces of very high dimension or spaces that are not Euclidean at all. For example, it is
challenging to cluster documents by their topic, based on the occurrence of common, un-
usual words in the documents. It is challenging to cluster moviegoers by the type or types
of movies they like.
We also considered in Section 3.5 distance measures for non-Euclidean spaces. These
include the Jaccard distance, cosine distance, Hamming distance, and edit distance. Recall
that the requirements for a function on pairs of points to be a distance measure are that
(1) Distances are always nonnegative, and only the distance between a point and itself is
0.
(2) Distance is symmetric; it doesn't matter in which order you consider the points when
computing their distance.
Search WWH ::




Custom Search