Database Reference
In-Depth Information
7.5.2
Initializing the Cluster Tree
The clusters are organized into a tree, and the nodes of the tree may be very large, perhaps
disk blocks or pages, as would be the case for a B-tree or R-tree, which the cluster-repres-
enting tree resembles. Each leaf of the tree holds as many cluster representations as can fit.
Note that a cluster representation has a size that does not depend on the number of points
in the cluster.
An interior node of the cluster tree holds a sample of the clustroids of the clusters repres-
ented by each of its subtrees, along with pointers to the roots of those subtrees. The samples
are of fixed size, so the number of children that an interior node may have is independent
of its level. Notice that as we go up the tree, the probability that a given cluster's clustroid
is part of the sample diminishes.
We initialize the cluster tree by taking a main-memory sample of the dataset and clus-
tering it hierarchically. The result of this clustering is a tree T , but T is not exactly the tree
used by the GRGPF Algorithm. Rather, we select from T certain of its nodes that represent
clusters of approximately some desired size n . These are the initial clusters for the GRGPF
Algorithm, and we place their representations at the leaf of the cluster-representing tree.
We then group clusters with a common ancestor in T into interior nodes of the cluster-rep-
resenting tree, so in some sense, clusters descended from one interior node are as close as
possible. In some cases, rebalancing of the cluster-representing tree will be necessary. This
process is similar to the reorganization of a B-tree, and we shall not examine this issue in
detail.
7.5.3
Adding Points in the GRGPF Algorithm
We now read points from secondary storage and insert each one into the nearest cluster. We
start at the root, and look at the samples of clustroids for each of the children of the root.
Whichever child has the clustroid closest to the new point p is the node we examine next.
When we reach any node in the tree, we look at the sample clustroids for its children and
go next to the child with the clustroid closest to p . Note that some of the sample clustroids
at a node may have been seen at a higher level, but each level provides more detail about
the clusters lying below, so we see many new sample clustroids each time we go a level
down the tree.
Finally, we reach a leaf. This leaf has the cluster features for each cluster represented by
that leaf, and we pick the cluster whose clustroid is closest to p . We adjust the representa-
tion of this cluster to account for the new node p . In particular, we:
(1) Add 1 to N .
Search WWH ::




Custom Search