Database Reference
In-Depth Information
Processing Points in CURE : After creating representative points for each cluster, the entire set of points can be read
from disk and assigned to a cluster. We assign a given point to the cluster of the representative point that is closest
to the given point.
The GRGPF Algorithm : This algorithm is of the point-assignment type. It handles data that is too big to fit in main
memory, and it does not assume a Euclidean space.
Representing Clusters in GRGPF : A cluster is represented by the count of points in the cluster, the clustroid, a set
of points nearest the clustroid and a set of points furthest from the clustroid. The nearby points allow us to change
the clustroid if the cluster evolves, and the distant points allow for merging clusters efficiently in appropriate cir-
cumstances. For each of these points, we also record the rowsum, that is the square root of the sum of the squares of
the distances from that point to all the other points of the cluster.
Tree Organization of Clusters in GRGPF : Cluster representations are organized into a tree structure like a B-tree,
where nodes of the tree are typically disk blocks and contain information about many clusters. The leaves hold the
representation of as many clusters as possible, while interior nodes hold a sample of the clustroids of the clusters
at their descendant leaves. We organize the tree so that the clusters whose representatives are in any subtree are as
close as possible.
Processing Points in GRGPF : After initializing clusters from a sample of points, we insert each point into the cluster
with the nearest clustroid. Because of the tree structure, we can start at the root and choose to visit the child with the
sample clustroid nearest to the given point. Following this rule down one path in the tree leads us to a leaf, where
we insert the point into the cluster with the nearest clustroid on that leaf.
Clustering Streams : A generalization of the DGIM Algorithm (for counting 1s in the sliding window of a stream)
can be used to cluster points that are part of a slowly evolving stream. The BDMO Algorithm uses buckets similar
to those of DGIM, with allowable bucket sizes forming a sequence where each size is twice the previous size.
Representation of Buckets in BDMO : The size of a bucket is the number of points it represents. The bucket itself
holds only a representation of the clusters of these points, not the points themselves. A cluster representation in-
cludes a count of the number of points, the centroid or clustroid, and other information that is needed for merging
clusters according to some selected strategy.
Merging Buckets in BDMO : When buckets must be merged, we find the best matching of clusters, one from each
of the buckets, and merge them in pairs. If the stream evolves slowly, then we expect consecutive buckets to have
almost the same cluster centroids, so this matching makes sense.
Answering Queries in BDMO : A query is a length of a suffix of the sliding window. We take all the clusters in all
the buckets that are at least partially within that suffix and merge them using some strategy. The resulting clusters
are the answer to the query.
Clustering Using MapReduce : We can divide the data into chunks and cluster each chunk in parallel, using a Map
task. The clusters from each Map task can be further clustered in a single Reduce task.
7.8 References for Chapter 7
The ancestral study of clustering for large-scale data is the BIRCH Algorithm of [ 6 ] . The
BFR Algorithm is from [ 2 ] . The CURE Algorithm is found in [ 5 ] .
The paper on the GRGPF Algorithm is [ 3 ] . The necessary background regarding B-trees
and R-trees can be found in [ 4 ] . The study of clustering on streams is taken from [ 1 ] .
[1] B. Babcock, M. Datar, R. Motwani, and L. O'Callaghan, “Maintaining variance and k-medians over data stream
windows,” Proc. ACM Symp. on Principles of Database Systems , pp. 234-243, 2003.
[2] P.S. Bradley, U.M. Fayyad, and C. Reina, “Scaling clustering algorithms to large databases,” Proc. Knowledge
Discovery and Data Mining , pp. 9-15, 1998.
Search WWH ::




Custom Search