Database Reference
In-Depth Information
7.5 Clustering in Non-Euclidean Spaces
We shall next consider an algorithm that handles non-main-memory data, but does not re-
quire a Euclidean space. The algorithm, which we shall refer to as GRGPF for its authors
(V. Ganti, R. Ramakrishnan, J. Gehrke, A. Powell, and J. French), takes ideas from both
hierarchical and point-assignment approaches. Like CURE, it represents clusters by sample
points in main memory. However, it also tries to organize the clusters hierarchically, in a
tree, so a new point can be assigned to the appropriate cluster by passing it down the tree.
Leaves of the tree hold summaries of some clusters, and interior nodes hold subsets of
the information describing the clusters reachable through that node. An attempt is made to
group clusters by their distance from one another, so the clusters at a leaf are close, and the
clusters reachable from one interior node are relatively close as well.
7.5.1
Representing Clusters in the GRGPF Algorithm
As we assign points to clusters, the clusters can grow large. Most of the points in a cluster
are stored on disk, and are not used in guiding the assignment of points, although they can
be retrieved. The representation of a cluster in main memory consists of several features .
Before listing these features, if p is any point in a cluster, let ROWSUM( p ) be the sum of
the squares of the distances from p to each of the other points in the cluster. Note that, al-
though we are not in a Euclidean space, there is some distance measure d that applies to
points, or else it is not possible to cluster points at all. The following features form the rep-
resentation of a cluster.
(1) N , the number of points in the cluster.
(2) The clustroid of the cluster, which is defined specifically to be the point in the cluster
that minimizes the sum of the squares of the distances to the other points; that is, the
clustroid is the point in the cluster with the smallest ROWSUM.
(3) The rowsum of the clustroid of the cluster.
(4) For some chosen constant k , the k points of the cluster that are closest to the clustroid,
and their rowsums. These points are part of the representation in case the addition of
points to the cluster causes the clustroid to change. The assumption is made that the
new clustroid would be one of these k points near the old clustroid.
(5) The k points of the cluster that are furthest from the clustroid and their rowsums. These
points are part of the representation so that we can consider whether two clusters are
close enough to merge. The assumption is made that if two clusters are close, then a
pair of points distant from their respective clustroids would be close.
Search WWH ::




Custom Search