Clustering - Mining of Massive Datasets

Database Reference

In-Depth Information

In the second dimension, the variance is 5 / 3 − (−1 / 3) 2 = 1 . 56, so the standard deviation is

1.25.

□

7.3.5

Processing Data in the BFR Algorithm

We shall now outline what happens when we process a chunk of points.

(1) First, all points that are sufficiently close to the centroid of a cluster are added to that

cluster. As described in the box on benefits, it is simple to add the information about

the point to the N , SUM, and SUMSQ that represent the cluster. We then discard the

point. The question of what “sufficiently close” means will be addressed shortly.

(2) For the points that are not sufficiently close to any centroid, we cluster them, along

with the points in the retained set. Any main-memory clustering algorithm can be

used, such as the hierarchical methods discussed in Section 7.2 . We must use some

criterion for deciding when it is reasonable to combine two points into a cluster or

two clusters into one. Section 7.2.3 covered the ways we might make this decision.

Clusters of more than one point are summarized and added to the compressed set.

Singleton clusters become the retained set of points.

(3) We now have miniclusters derived from our attempt to cluster new points and the old

retained set, and we have the miniclusters from the old compressed set. Although none

of these miniclusters can be merged with one of the k clusters, they might merge with

one another. The criterion for merger may again be chosen according to the discussion

in Section 7.2.3 . Note that the form of representation for compressed sets ( N , SUM,

and SUMSQ) makes it easy to compute statistics such as the variance for the combin-

ation of two miniclusters that we consider merging.

(4) Points that are assigned to a cluster or a minicluster, i.e., those that are not in the re-

tained set, are written out, with their assignment, to secondary memory.

Finally, if this is the last chunk of input data, we need to do something with the com-

pressed and retained sets. We can treat them as outliers, and never cluster them at all. Or,

we can assign each point in the retained set to the cluster of the nearest centroid. We can

combine each minicluster with the cluster whose centroid is closest to the centroid of the

minicluster.

An important decision that must be examined is how we decide whether a new point p

is close enough to one of the k clusters that it makes sense to add p to the cluster. Two ap-

proaches have been suggested.

Search WWH ::

Custom Search

Home