Clustering - Mining of Massive Datasets

Database Reference

In-Depth Information

7.5.2

Initializing the Cluster Tree

The clusters are organized into a tree, and the nodes of the tree may be very large, perhaps

disk blocks or pages, as would be the case for a B-tree or R-tree, which the cluster-repres-

enting tree resembles. Each leaf of the tree holds as many cluster representations as can fit.

Note that a cluster representation has a size that does not depend on the number of points

in the cluster.

An interior node of the cluster tree holds a sample of the clustroids of the clusters repres-

ented by each of its subtrees, along with pointers to the roots of those subtrees. The samples

are of fixed size, so the number of children that an interior node may have is independent

of its level. Notice that as we go up the tree, the probability that a given cluster's clustroid

is part of the sample diminishes.

We initialize the cluster tree by taking a main-memory sample of the dataset and clus-

tering it hierarchically. The result of this clustering is a tree T , but T is not exactly the tree

used by the GRGPF Algorithm. Rather, we select from T certain of its nodes that represent

clusters of approximately some desired size n . These are the initial clusters for the GRGPF

Algorithm, and we place their representations at the leaf of the cluster-representing tree.

We then group clusters with a common ancestor in T into interior nodes of the cluster-rep-

resenting tree, so in some sense, clusters descended from one interior node are as close as

possible. In some cases, rebalancing of the cluster-representing tree will be necessary. This

process is similar to the reorganization of a B-tree, and we shall not examine this issue in

detail.

7.5.3

Adding Points in the GRGPF Algorithm

We now read points from secondary storage and insert each one into the nearest cluster. We

start at the root, and look at the samples of clustroids for each of the children of the root.

Whichever child has the clustroid closest to the new point p is the node we examine next.

When we reach any node in the tree, we look at the sample clustroids for its children and

go next to the child with the clustroid closest to p . Note that some of the sample clustroids

at a node may have been seen at a higher level, but each level provides more detail about

the clusters lying below, so we see many new sample clustroids each time we go a level

down the tree.

Finally, we reach a leaf. This leaf has the cluster features for each cluster represented by

that leaf, and we pick the cluster whose clustroid is closest to p . We adjust the representa-

tion of this cluster to account for the new node p . In particular, we:

(1) Add 1 to N .

Search WWH ::

Custom Search

Home