Databases Reference
In-Depth Information
a value as far as four standard deviations from the mean is less than one in a
million. Thus, if the points in the cluster are really normally distributed, then
the probability that we will fail to include a point that truly belongs is less
than 10 −6 . And such a point is likely to be assigned to that cluster eventually
anyway, as long as it does not wind up closer to some other centroid as centroids
migrate in response to points added to their cluster.
7.3.6 Exercises for Section 7.3
Exercise 7.3.1 : For the points of Fig. 7.8, if we select three starting points
using the method of Section 7.3.2, and the first point we choose is (3,4), which
other points are selected.
!! Exercise 7.3.2 : Prove that no matter what point we start with in Fig. 7.8, if
we select three starting points by the method of Section 7.3.2 we obtain points in
each of the three clusters. Hint : You could solve this exhaustively by begining
with each of the twelve points in turn. However, a more generally applicable
solution is to consider the diameters of the three clusters and also consider
the minimum intercluster distance, that is, the minimum distance between two
points chosen from two different clusters. Can you prove a general theorem
based on these two parameters of a set of points?
! Exercise 7.3.3 : Give an example of a dataset and a selection of k initial
centroids such that when the points are reassigned to their nearest centroid at
the end, at least one of the initial k points is reassigned to a different cluster.
Exercise 7.3.4 : For the three clusters of Fig. 7.8:
(a) Compute the representation of the cluster as in the BFR Algorithm. That
is, compute N , SUM , and SUMSQ .
(b) Compute the variance and standard deviation of each cluster in each of
the two dimensions.
Exercise 7.3.5 : Suppose a cluster of three-dimensional points has standard
deviations of 2, 3, and 5, in the three dimensions, in that order. Compute the
Mahalanobis distance between the origin (0, 0, 0) and the point (1,−3, 4).
7.4
The CURE Algorithm
We now turn to another large-scale-clustering algorithm in the point-assignment
class. This algorithm, called CURE (Clustering Using REpresentatives), as-
sumes a Euclidean space. However, it does not assume anything about the
shape of clusters; they need not be normally distributed, and can even have
strange bends, S-shapes, or even rings. Instead of representing clusters by their
centroid, it uses a collection of representative points, as the name implies.
Search WWH ::




Custom Search