Clustering - Mining of Massive Datasets

Database Reference

In-Depth Information

Begin by creating many Map tasks. Each task is assigned a subset of the points. The Map

function's job is to cluster the points it is given. Its output is a set of key-value pairs with

a fixed key 1, and a value that is the description of one cluster. This description can be any

of the possibilities suggested in Section 7.6.2 , such as the centroid, count, and diameter of

the cluster.

Since all key-value pairs have the same key, there can be only one Reduce task. This task

gets descriptions of the clusters produced by each of the Map tasks, and must merge them

appropriately. We may use the discussion in Section 7.6.4 as representative of the various

strategies we might use to produce the final clustering, which is the output of the Reduce

task.

7.6.7

Exercises for Section 7.6

EXERCISE 7.6.1 Execute the BDMO Algorithm with p = 3 on the following 1-dimensional,

Euclidean data:

1 , 45 , 80 , 24 , 56 , 71 , 17 , 40 , 66 , 32 , 48 , 96 , 9 , 41 , 75 , 11 , 58 , 93 , 28 , 39 , 77

The clustering algorithms is k -means with k = 3. Only the centroid of a cluster, along with

its count, is needed to represent a cluster.

EXERCISE 7.6.2 Using your clusters from Exercise 7.6.1 , produce the best centroids in re-

sponse to a query asking for a clustering of the last 10 points.

7.7 Summary of Chapter 7

✦ Clustering : Clusters are often a useful summary of data that is in the form of points in some space. To cluster points,

we need a distance measure on that space. Ideally, points in the same cluster have small distances between them,

while points in different clusters have large distances between them.

✦ Clustering Algorithms : Clustering algorithms generally have one of two forms. Hierarchical clustering algorithms

begin with all points in a cluster of their own, and nearby clusters are merged iteratively. Point-assignment clustering

algorithms consider points in turn and assign them to the cluster in which they best fit.

✦ The Curse of Dimensionality : Points in high-dimensional Euclidean spaces, as well as points in non-Euclidean

spaces often behave unintuitively. Two unexpected properties of these spaces are that random points are almost al-

ways at about the same distance, and random vectors are almost always orthogonal.

✦ Centroids and Clustroids : In a Euclidean space, the members of a cluster can be averaged, and this average is called

the centroid. In non-Euclidean spaces, there is no guarantee that points have an “average,” so we are forced to use

one of the members of the cluster as a representative or typical element of the cluster. That representative is called

the clustroid.

Search WWH ::

Custom Search

Home