Database Reference
In-Depth Information
Begin by creating many Map tasks. Each task is assigned a subset of the points. The Map
function's job is to cluster the points it is given. Its output is a set of key-value pairs with
a fixed key 1, and a value that is the description of one cluster. This description can be any
of the possibilities suggested in Section 7.6.2 , such as the centroid, count, and diameter of
the cluster.
Since all key-value pairs have the same key, there can be only one Reduce task. This task
gets descriptions of the clusters produced by each of the Map tasks, and must merge them
appropriately. We may use the discussion in Section 7.6.4 as representative of the various
strategies we might use to produce the final clustering, which is the output of the Reduce
task.
7.6.7
Exercises for Section 7.6
EXERCISE 7.6.1 Execute the BDMO Algorithm with p = 3 on the following 1-dimensional,
Euclidean data:
1 , 45 , 80 , 24 , 56 , 71 , 17 , 40 , 66 , 32 , 48 , 96 , 9 , 41 , 75 , 11 , 58 , 93 , 28 , 39 , 77
The clustering algorithms is k -means with k = 3. Only the centroid of a cluster, along with
its count, is needed to represent a cluster.
EXERCISE 7.6.2 Using your clusters from Exercise 7.6.1 , produce the best centroids in re-
sponse to a query asking for a clustering of the last 10 points.
7.7 Summary of Chapter 7
Clustering : Clusters are often a useful summary of data that is in the form of points in some space. To cluster points,
we need a distance measure on that space. Ideally, points in the same cluster have small distances between them,
while points in different clusters have large distances between them.
Clustering Algorithms : Clustering algorithms generally have one of two forms. Hierarchical clustering algorithms
begin with all points in a cluster of their own, and nearby clusters are merged iteratively. Point-assignment clustering
algorithms consider points in turn and assign them to the cluster in which they best fit.
The Curse of Dimensionality : Points in high-dimensional Euclidean spaces, as well as points in non-Euclidean
spaces often behave unintuitively. Two unexpected properties of these spaces are that random points are almost al-
ways at about the same distance, and random vectors are almost always orthogonal.
Centroids and Clustroids : In a Euclidean space, the members of a cluster can be averaged, and this average is called
the centroid. In non-Euclidean spaces, there is no guarantee that points have an “average,” so we are forced to use
one of the members of the cluster as a representative or typical element of the cluster. That representative is called
the clustroid.
Search WWH ::




Custom Search