Clustering - Mining of Massive Datasets

Databases Reference

In-Depth Information

the other points; that is, the clustroid is the point in the cluster with the

smallest ROWSUM .

3. The rowsum of the clustroid of the cluster.

4. For some chosen constant k, the k points of the cluster that are closest to

the clustroid, and their rowsums. These points are part of the represen-

tation in case the addition of points to the cluster causes the clustroid to

change. The assumption is made that the new clustroid would be one of

these k points near the old clustroid.

5. The k points of the cluster that are furthest from the clustroid and their

rowsums. These points are part of the representation so that we can

consider whether two clusters are close enough to merge. The assumption

is made that if two clusters are close, then a pair of points distant from

their respective clustroids would be close.

These features form the representation of a cluster.

7.5.2

Initializing the Cluster Tree

The clusters are organized into a tree, and the nodes of the tree may be very

large, perhaps disk blocks or pages, as would be the case for a B-tree or R-tree,

which the cluster-representing tree resembles. Each leaf of the tree holds as

many cluster representations as can fit. Note that a cluster representation has

a size that does not depend on the number of points in the cluster.

An interior node of the cluster tree holds a sample of the clustroids of the

clusters represented by each of its subtrees, along with pointers to the roots of

those subtrees. The samples are of fixed size, so the number of children that

an interior node may have is independent of its level. Notice that as we go up

the tree, the probability that a given cluster's clustroid is part of the sample

diminishes.

We initialize the cluster tree by taking a main-memory sample of the dataset

and clustering it hierarchically. The result of this clustering is a tree T , but

T is not exactly the tree used by the GRGPF Algorithm. Rather, we select

from T certain of its nodes that represent clusters of approximately some de-

sired size n. These are the initial clusters for the GRGPF Algorithm, and we

place their representations at the leaf of the cluster-representing tree. We then

group clusters with a common ancestor in T into interior nodes of the cluster-

representing tree, so in some sense, clusters descended from one interior node

are close as possible. In some cases, rebalancing of the cluster-representing tree

will be necessary. This process is similar to the reorganization of a B-tree, and

we shall not examine this issue in detail.

Search WWH ::

Custom Search

Home