Clustering - Mining of Massive Datasets

Databases Reference

In-Depth Information

a value as far as four standard deviations from the mean is less than one in a

million. Thus, if the points in the cluster are really normally distributed, then

the probability that we will fail to include a point that truly belongs is less

than 10 −6 . And such a point is likely to be assigned to that cluster eventually

anyway, as long as it does not wind up closer to some other centroid as centroids

migrate in response to points added to their cluster.

7.3.6 Exercises for Section 7.3

Exercise 7.3.1 : For the points of Fig. 7.8, if we select three starting points

using the method of Section 7.3.2, and the first point we choose is (3,4), which

other points are selected.

!! Exercise 7.3.2 : Prove that no matter what point we start with in Fig. 7.8, if

we select three starting points by the method of Section 7.3.2 we obtain points in

each of the three clusters. Hint : You could solve this exhaustively by begining

with each of the twelve points in turn. However, a more generally applicable

solution is to consider the diameters of the three clusters and also consider

the minimum intercluster distance, that is, the minimum distance between two

points chosen from two different clusters. Can you prove a general theorem

based on these two parameters of a set of points?

! Exercise 7.3.3 : Give an example of a dataset and a selection of k initial

centroids such that when the points are reassigned to their nearest centroid at

the end, at least one of the initial k points is reassigned to a different cluster.

Exercise 7.3.4 : For the three clusters of Fig. 7.8:

(a) Compute the representation of the cluster as in the BFR Algorithm. That

is, compute N , SUM , and SUMSQ .

(b) Compute the variance and standard deviation of each cluster in each of

the two dimensions.

Exercise 7.3.5 : Suppose a cluster of three-dimensional points has standard

deviations of 2, 3, and 5, in the three dimensions, in that order. Compute the

Mahalanobis distance between the origin (0, 0, 0) and the point (1,−3, 4).

7.4

The CURE Algorithm

We now turn to another large-scale-clustering algorithm in the point-assignment

class. This algorithm, called CURE (Clustering Using REpresentatives), as-

sumes a Euclidean space. However, it does not assume anything about the

shape of clusters; they need not be normally distributed, and can even have

strange bends, S-shapes, or even rings. Instead of representing clusters by their

centroid, it uses a collection of representative points, as the name implies.

Search WWH ::

Custom Search

Home