Database Reference
In-Depth Information
In the second dimension, the variance is 5 / 3 − (−1 / 3) 2 = 1 . 56, so the standard deviation is
1.25.
7.3.5
Processing Data in the BFR Algorithm
We shall now outline what happens when we process a chunk of points.
(1) First, all points that are sufficiently close to the centroid of a cluster are added to that
cluster. As described in the box on benefits, it is simple to add the information about
the point to the N , SUM, and SUMSQ that represent the cluster. We then discard the
point. The question of what “sufficiently close” means will be addressed shortly.
(2) For the points that are not sufficiently close to any centroid, we cluster them, along
with the points in the retained set. Any main-memory clustering algorithm can be
used, such as the hierarchical methods discussed in Section 7.2 . We must use some
criterion for deciding when it is reasonable to combine two points into a cluster or
two clusters into one. Section 7.2.3 covered the ways we might make this decision.
Clusters of more than one point are summarized and added to the compressed set.
Singleton clusters become the retained set of points.
(3) We now have miniclusters derived from our attempt to cluster new points and the old
retained set, and we have the miniclusters from the old compressed set. Although none
of these miniclusters can be merged with one of the k clusters, they might merge with
one another. The criterion for merger may again be chosen according to the discussion
in Section 7.2.3 . Note that the form of representation for compressed sets ( N , SUM,
and SUMSQ) makes it easy to compute statistics such as the variance for the combin-
ation of two miniclusters that we consider merging.
(4) Points that are assigned to a cluster or a minicluster, i.e., those that are not in the re-
tained set, are written out, with their assignment, to secondary memory.
Finally, if this is the last chunk of input data, we need to do something with the com-
pressed and retained sets. We can treat them as outliers, and never cluster them at all. Or,
we can assign each point in the retained set to the cluster of the nearest centroid. We can
combine each minicluster with the cluster whose centroid is closest to the centroid of the
minicluster.
An important decision that must be examined is how we decide whether a new point p
is close enough to one of the k clusters that it makes sense to add p to the cluster. Two ap-
proaches have been suggested.
Search WWH ::




Custom Search