Databases Reference
In-Depth Information
the other points; that is, the clustroid is the point in the cluster with the
smallest ROWSUM .
3. The rowsum of the clustroid of the cluster.
4. For some chosen constant k, the k points of the cluster that are closest to
the clustroid, and their rowsums. These points are part of the represen-
tation in case the addition of points to the cluster causes the clustroid to
change. The assumption is made that the new clustroid would be one of
these k points near the old clustroid.
5. The k points of the cluster that are furthest from the clustroid and their
rowsums. These points are part of the representation so that we can
consider whether two clusters are close enough to merge. The assumption
is made that if two clusters are close, then a pair of points distant from
their respective clustroids would be close.
These features form the representation of a cluster.
7.5.2
Initializing the Cluster Tree
The clusters are organized into a tree, and the nodes of the tree may be very
large, perhaps disk blocks or pages, as would be the case for a B-tree or R-tree,
which the cluster-representing tree resembles. Each leaf of the tree holds as
many cluster representations as can fit. Note that a cluster representation has
a size that does not depend on the number of points in the cluster.
An interior node of the cluster tree holds a sample of the clustroids of the
clusters represented by each of its subtrees, along with pointers to the roots of
those subtrees. The samples are of fixed size, so the number of children that
an interior node may have is independent of its level. Notice that as we go up
the tree, the probability that a given cluster's clustroid is part of the sample
diminishes.
We initialize the cluster tree by taking a main-memory sample of the dataset
and clustering it hierarchically. The result of this clustering is a tree T , but
T is not exactly the tree used by the GRGPF Algorithm. Rather, we select
from T certain of its nodes that represent clusters of approximately some de-
sired size n. These are the initial clusters for the GRGPF Algorithm, and we
place their representations at the leaf of the cluster-representing tree. We then
group clusters with a common ancestor in T into interior nodes of the cluster-
representing tree, so in some sense, clusters descended from one interior node
are close as possible. In some cases, rebalancing of the cluster-representing tree
will be necessary. This process is similar to the reorganization of a B-tree, and
we shall not examine this issue in detail.
Search WWH ::




Custom Search