Clustering - Mining of Massive Datasets

Databases Reference

In-Depth Information

Benefits of the N, SUM , SUMSQ Representation

There is a significant advantage to representing sets of points as it is done

in the BFR Algorithm, rather than by storing N , the centroid, and the

standard deviation in each dimension. Consider what we need to do when

we add a new point to a cluster. N is increased by 1, of course. But we

can also add the vector representing the location of the point to SUM to

get the new SUM , and we can add the squares of the ith components of

the vector to SUMSQ i to get the new SUMSQ i . Had we used the centroid in

place of SUM , then we could not adjust the centroid to account for the new

point without doing some calculation involving N , and the recomputation

of the standard deviations would be far more complex as well. Similarly,

if we want to combine two sets, we just add corresponding values of N ,

SUM , and SUMSQ , while if we used the centroid and standard deviations

as a representation, the calculation would be far more complex.

1. First, all points that are su ciently close to the centroid of a cluster are

added to that cluster. As described in the box on benefits, it is simple

to add the information about the point to the N , SUM , and SUMSQ that

represent the cluster. We then discard the point. The question of what

“su ciently close” means will be addressed shortly.

2. For the points that are not su ciently close to any centroid, we clus-

ter them, along with the points in the retained set. Any main-memory

clustering algorithm can be used, such as the hierarchical methods dis-

cussed in Section 7.2. We must use some criterion for deciding when it is

reasonable to combine two points into a cluster or two clusters into one.

Section 7.2.3 covered the ways we might make this decision. Clusters of

more than one point are summarized and added to the compressed set.

Singleton clusters become the retained set of points.

3. We now have miniclusters derived from our attempt to cluster new points

and the old retained set, and we have the miniclusters from the old com-

pressed set. Although none of these miniclusters can be merged with one

of the k clusters, they might merge with one another. The criterion for

merger may again be chosen according to the discussion in Section 7.2.3.

Note that the form of representation for compressed sets (N , SUM , and

SUMSQ ) makes it easy to compute statistics such as the variance for the

combination of two miniclusters that we consider merging.

4. Points that are assigned to a cluster or a minicluster, i.e., those that

are not in the retained set, are written out, with their assignment, to

secondary memory.

Search WWH ::

Custom Search

Home