Clustering - Mining of Massive Datasets

Database Reference

In-Depth Information

7.5.4

Splitting and Merging Clusters

The GRGPF Algorithm assumes that there is a limit on the radius that a cluster may have.

The particular definition used for the radius is where c is the clustroid of the cluster

and N the number of points in the cluster. That is, the radius is the square root of the av-

erage square of the distance from the clustroid of the points in the cluster. If a cluster's

radius grows too large, it is split into two. The points of that cluster are brought into main

memory, and divided into two clusters to minimize the rowsums. The cluster features for

both clusters are computed.

As a result, the leaf of the split cluster now has one more cluster to represent. We should

manage the cluster tree like a B-tree, so usually, there will be room in a leaf to add one more

cluster. However, if not, then the leaf must be split into two leaves. To implement the split,

we must add another pointer and more sample clustroids at the parent node. Again, there

may be extra space, but if not, then this node too must be split, and we do so to minimize

the squares of the distances between the sample clustroids assigned to different nodes. As

in a B-tree, this splitting can ripple all the way up to the root, which can then be split if

needed.

The worst thing that can happen is that the cluster-representing tree is now too large to

fit in main memory. There is only one thing to do: we make it smaller by raising the limit

on how large the radius of a cluster can be, and we consider merging pairs of clusters. It is

normally sufficient to consider clusters that are “nearby,” in the sense that their represent-

atives are at the same leaf or at leaves with a common parent. However, in principle, we

can consider merging any two clusters C 1 and C 2 into one cluster C .

To merge clusters, we assume that the clustroid of C will be one of the points that are as

far as possible from the clustroid of C 1 or the clustroid of C 2 . Suppose we want to compute

the rowsum in C for the point p , which is one of the k points in C 1 that are as far as possible

from the centroid of C 1 . We use the curse-of-dimensionality argument that says all angles

are approximately right angles, to justify the following formula.

ROWSUM C ( p ) = ROWSUM C 1 ( p ) + N C 2 ( d 2 ( p, c 1 ) + d 2 ( c 1 , c 2 )) + ROWSUM C 2 ( c 2 )

In the above, we subscript N and ROWSUM by the cluster to which that feature refers. We

use c 1 and c 2 for the clustroids of C 1 and C 2 , respectively.

In detail, we compute the sum of the squares of the distances from p to all the nodes in

the combined cluster C by beginning with ROWSUM C 1 ( p ) to get the terms for the points

in the same cluster as p . For the N C 2 points q in C 2 , we consider the path from p to the clus-

troid of C 1 , then to the clustroid of C 2 , and finally to q . We assume there is a right angle

between the legs from p to c 1 and c 1 to c 2 , and another right angle between the shortest path

from from p to c 2 and the leg from c 2 to q . We then use the Pythagorean theorem to justify

Search WWH ::

Custom Search

Home