Clustering - Mining of Massive Datasets

Database Reference

In-Depth Information

✦ Processing Points in CURE : After creating representative points for each cluster, the entire set of points can be read

from disk and assigned to a cluster. We assign a given point to the cluster of the representative point that is closest

to the given point.

✦ The GRGPF Algorithm : This algorithm is of the point-assignment type. It handles data that is too big to fit in main

memory, and it does not assume a Euclidean space.

✦ Representing Clusters in GRGPF : A cluster is represented by the count of points in the cluster, the clustroid, a set

of points nearest the clustroid and a set of points furthest from the clustroid. The nearby points allow us to change

the clustroid if the cluster evolves, and the distant points allow for merging clusters efficiently in appropriate cir-

cumstances. For each of these points, we also record the rowsum, that is the square root of the sum of the squares of

the distances from that point to all the other points of the cluster.

✦ Tree Organization of Clusters in GRGPF : Cluster representations are organized into a tree structure like a B-tree,

where nodes of the tree are typically disk blocks and contain information about many clusters. The leaves hold the

representation of as many clusters as possible, while interior nodes hold a sample of the clustroids of the clusters

at their descendant leaves. We organize the tree so that the clusters whose representatives are in any subtree are as

close as possible.

✦ Processing Points in GRGPF : After initializing clusters from a sample of points, we insert each point into the cluster

with the nearest clustroid. Because of the tree structure, we can start at the root and choose to visit the child with the

sample clustroid nearest to the given point. Following this rule down one path in the tree leads us to a leaf, where

we insert the point into the cluster with the nearest clustroid on that leaf.

✦ Clustering Streams : A generalization of the DGIM Algorithm (for counting 1s in the sliding window of a stream)

can be used to cluster points that are part of a slowly evolving stream. The BDMO Algorithm uses buckets similar

to those of DGIM, with allowable bucket sizes forming a sequence where each size is twice the previous size.

✦ Representation of Buckets in BDMO : The size of a bucket is the number of points it represents. The bucket itself

holds only a representation of the clusters of these points, not the points themselves. A cluster representation in-

cludes a count of the number of points, the centroid or clustroid, and other information that is needed for merging

clusters according to some selected strategy.

✦ Merging Buckets in BDMO : When buckets must be merged, we find the best matching of clusters, one from each

of the buckets, and merge them in pairs. If the stream evolves slowly, then we expect consecutive buckets to have

almost the same cluster centroids, so this matching makes sense.

✦ Answering Queries in BDMO : A query is a length of a suffix of the sliding window. We take all the clusters in all

the buckets that are at least partially within that suffix and merge them using some strategy. The resulting clusters

are the answer to the query.

✦ Clustering Using MapReduce : We can divide the data into chunks and cluster each chunk in parallel, using a Map

task. The clusters from each Map task can be further clustered in a single Reduce task.

7.8 References for Chapter 7

The ancestral study of clustering for large-scale data is the BIRCH Algorithm of [ 6 ] . The

BFR Algorithm is from [ 2 ] . The CURE Algorithm is found in [ 5 ] .

The paper on the GRGPF Algorithm is [ 3 ] . The necessary background regarding B-trees

and R-trees can be found in [ 4 ] . The study of clustering on streams is taken from [ 1 ] .

[1] B. Babcock, M. Datar, R. Motwani, and L. O'Callaghan, “Maintaining variance and k-medians over data stream

windows,” Proc. ACM Symp. on Principles of Database Systems , pp. 234-243, 2003.

[2] P.S. Bradley, U.M. Fayyad, and C. Reina, “Scaling clustering algorithms to large databases,” Proc. Knowledge

Discovery and Data Mining , pp. 9-15, 1998.

Mining of Massive Datasets

Search WWH ::

Custom Search

Home