Databases Reference
In-Depth Information
Benefits of the N, SUM , SUMSQ Representation
There is a significant advantage to representing sets of points as it is done
in the BFR Algorithm, rather than by storing N , the centroid, and the
standard deviation in each dimension. Consider what we need to do when
we add a new point to a cluster. N is increased by 1, of course. But we
can also add the vector representing the location of the point to SUM to
get the new SUM , and we can add the squares of the ith components of
the vector to SUMSQ i to get the new SUMSQ i . Had we used the centroid in
place of SUM , then we could not adjust the centroid to account for the new
point without doing some calculation involving N , and the recomputation
of the standard deviations would be far more complex as well. Similarly,
if we want to combine two sets, we just add corresponding values of N ,
SUM , and SUMSQ , while if we used the centroid and standard deviations
as a representation, the calculation would be far more complex.
1. First, all points that are su ciently close to the centroid of a cluster are
added to that cluster. As described in the box on benefits, it is simple
to add the information about the point to the N , SUM , and SUMSQ that
represent the cluster. We then discard the point. The question of what
“su ciently close” means will be addressed shortly.
2. For the points that are not su ciently close to any centroid, we clus-
ter them, along with the points in the retained set. Any main-memory
clustering algorithm can be used, such as the hierarchical methods dis-
cussed in Section 7.2. We must use some criterion for deciding when it is
reasonable to combine two points into a cluster or two clusters into one.
Section 7.2.3 covered the ways we might make this decision. Clusters of
more than one point are summarized and added to the compressed set.
Singleton clusters become the retained set of points.
3. We now have miniclusters derived from our attempt to cluster new points
and the old retained set, and we have the miniclusters from the old com-
pressed set. Although none of these miniclusters can be merged with one
of the k clusters, they might merge with one another. The criterion for
merger may again be chosen according to the discussion in Section 7.2.3.
Note that the form of representation for compressed sets (N , SUM , and
SUMSQ ) makes it easy to compute statistics such as the variance for the
combination of two miniclusters that we consider merging.
4. Points that are assigned to a cluster or a minicluster, i.e., those that
are not in the retained set, are written out, with their assignment, to
secondary memory.
Search WWH ::




Custom Search