Databases Reference
In-Depth Information
which has a number of important properties useful for clustering. In particular,
a Euclidean space's points are vectors of real numbers. The length of the vector
is the number of dimensions of the space. The components of the vector are
commonly called coordinates of the represented points.
All spaces for which we can perform a clustering have a distance measure,
giving a distance between any two points in the space. We introduced distances
in Section 3.5. The common Euclidean distance (square root of the sums of the
squares of the differences between the coordinates of the points in each dimen-
sion) serves for all Euclidean spaces, although we also mentioned some other
options for distance measures in Euclidean spaces, including the Manhattan
distance (sum of the magnitudes of the differences in each dimension) and the
L -distance (maximum magnitude of the difference in any dimension).
Beagles
Height
Chihuahuas
Dachshunds
Weight
Figure 7.1: Heights and weights of dogs taken from three varieties
Example 7.1 : Classical applications of clustering often involve low-dimen-
sional Euclidean spaces. For example, Fig. 7.1 shows height and weight mea-
surements of dogs of several varieties. Without knowing which dog is of which
variety, we can see just by looking at the diagram that the dogs fall into three
clusters, and those clusters happen to correspond to three varieties. With small
amounts of data, any clustering algorithm will establish the correct clusters, and
simply plotting the points and “eyeballing” the plot will su ce as well.
2
However, modern clustering problems are not so simple. They may involve
Euclidean spaces of very high dimension or spaces that are not Euclidean at
all. For example, it is challenging to cluster documents by their topic, based on
the occurrence of common, unusual words in the documents. It is challenging
to cluster moviegoers by the type or types of movies they like.
We also considered in Section 3.5 distance measures for non-Euclidean spa-
ces.
These include the Jaccard distance, cosine distance, Hamming distance,
Search WWH ::




Custom Search