Database Reference
In-Depth Information
7.5.4
Splitting and Merging Clusters
The GRGPF Algorithm assumes that there is a limit on the radius that a cluster may have.
The particular definition used for the radius is where c is the clustroid of the cluster
and N the number of points in the cluster. That is, the radius is the square root of the av-
erage square of the distance from the clustroid of the points in the cluster. If a cluster's
radius grows too large, it is split into two. The points of that cluster are brought into main
memory, and divided into two clusters to minimize the rowsums. The cluster features for
both clusters are computed.
As a result, the leaf of the split cluster now has one more cluster to represent. We should
manage the cluster tree like a B-tree, so usually, there will be room in a leaf to add one more
cluster. However, if not, then the leaf must be split into two leaves. To implement the split,
we must add another pointer and more sample clustroids at the parent node. Again, there
may be extra space, but if not, then this node too must be split, and we do so to minimize
the squares of the distances between the sample clustroids assigned to different nodes. As
in a B-tree, this splitting can ripple all the way up to the root, which can then be split if
needed.
The worst thing that can happen is that the cluster-representing tree is now too large to
fit in main memory. There is only one thing to do: we make it smaller by raising the limit
on how large the radius of a cluster can be, and we consider merging pairs of clusters. It is
normally sufficient to consider clusters that are “nearby,” in the sense that their represent-
atives are at the same leaf or at leaves with a common parent. However, in principle, we
can consider merging any two clusters C 1 and C 2 into one cluster C .
To merge clusters, we assume that the clustroid of C will be one of the points that are as
far as possible from the clustroid of C 1 or the clustroid of C 2 . Suppose we want to compute
the rowsum in C for the point p , which is one of the k points in C 1 that are as far as possible
from the centroid of C 1 . We use the curse-of-dimensionality argument that says all angles
are approximately right angles, to justify the following formula.
ROWSUM C ( p ) = ROWSUM C 1 ( p ) + N C 2 ( d 2 ( p, c 1 ) + d 2 ( c 1 , c 2 )) + ROWSUM C 2 ( c 2 )
In the above, we subscript N and ROWSUM by the cluster to which that feature refers. We
use c 1 and c 2 for the clustroids of C 1 and C 2 , respectively.
In detail, we compute the sum of the squares of the distances from p to all the nodes in
the combined cluster C by beginning with ROWSUM C 1 ( p ) to get the terms for the points
in the same cluster as p . For the N C 2 points q in C 2 , we consider the path from p to the clus-
troid of C 1 , then to the clustroid of C 2 , and finally to q . We assume there is a right angle
between the legs from p to c 1 and c 1 to c 2 , and another right angle between the shortest path
from from p to c 2 and the leg from c 2 to q . We then use the Pythagorean theorem to justify
Search WWH ::




Custom Search