Clustering - Mining of Massive Datasets

Database Reference

In-Depth Information

clustering. In particular, a Euclidean space's points are vectors of real numbers. The length

of the vector is the number of dimensions of the space. The components of the vector are

commonly called coordinates of the represented points.

All spaces for which we can perform a clustering have a distance measure, giving a dis-

tance between any two points in the space. We introduced distances in Section 3.5 . The

common Euclidean distance (square root of the sums of the squares of the differences

between the coordinates of the points in each dimension) serves for all Euclidean spaces,

although we also mentioned some other options for distance measures in Euclidean spaces,

including the Manhattan distance (sum of the magnitudes of the differences in each dimen-

sion) and the L ∞ -distance (maximum magnitude of the difference in any dimension).

EXAMPLE 7.1 Classical applications of clustering often involve low-dimensional Euclidean

spaces. For example, Fig. 7.1 shows height and weight measurements of dogs of several

varieties. Without knowing which dog is of which variety, we can see just by looking at

the diagram that the dogs fall into three clusters, and those clusters happen to correspond

to three varieties. With small amounts of data, any clustering algorithm will establish the

correct clusters, and simply plotting the points and “eyeballing” the plot will suffice as

well.

□

Figure 7.1 Heights and weights of dogs taken from three varieties

However, modern clustering problems are not so simple. They may involve Euclidean

spaces of very high dimension or spaces that are not Euclidean at all. For example, it is

challenging to cluster documents by their topic, based on the occurrence of common, un-

usual words in the documents. It is challenging to cluster moviegoers by the type or types

of movies they like.

We also considered in Section 3.5 distance measures for non-Euclidean spaces. These

include the Jaccard distance, cosine distance, Hamming distance, and edit distance. Recall

that the requirements for a function on pairs of points to be a distance measure are that

(1) Distances are always nonnegative, and only the distance between a point and itself is

0.

(2) Distance is symmetric; it doesn't matter in which order you consider the points when

computing their distance.

Search WWH ::

Custom Search

Home