Clustering - Mining of Massive Datasets

Database Reference

In-Depth Information

7.3.4

The Algorithm of Bradley, Fayyad, and Reina

This algorithm, which we shall refer to as BFR after its authors, is a variant of k -means that

is designed to cluster data in a high-dimensional Euclidean space. It makes a very strong

assumption about the shape of clusters: they must be normally distributed about a centroid.

The mean and standard deviation for a cluster may differ for different dimensions, but the

dimensions must be independent. For instance, in two dimensions a cluster may be cigar-

shaped, but the cigar must not be rotated off of the axes. Figure 7.10 makes the point.

Figure 7.10 The clusters in data for which the BFR algorithm may be used can have standard deviations that differ

along different axes, but the axes of the cluster must align with the axes of the space

The BFR Algorithm begins by selecting k points, using one of the methods discussed in

Section 7.3.2 . Then, the points of the data file are read in chunks. These might be chunks

from a distributed file system or a conventional file might be partitioned into chunks of the

appropriate size. Each chunk must consist of few enough points that they can be processed

in main memory. Also stored in main memory are summaries of the k clusters and some

other data, so the entire memory is not available to store a chunk. The main-memory data

other than the chunk from the input consists of three types of objects:

(1) The Discard Set : These are simple summaries of the clusters themselves. We shall ad-

dress the form of cluster summarization shortly. Note that the cluster summaries are

not “discarded”; they are in fact essential. However, the points that the summary rep-

resents are discarded and have no representation in main memory other than through

this summary.

(2) The Compressed Set : These are summaries, similar to the cluster summaries, but for

sets of points that have been found close to one another, but not close to any cluster.

The points represented by the compressed set are also discarded, in the sense that they

do not appear explicitly in main memory. We call the represented sets of points min-

iclusters .

(3) The Retained Set : Certain points can neither be assigned to a cluster nor are they suffi-

ciently close to any other points that we can represent them by a compressed set. These

points are held in main memory exactly as they appear in the input file.

The picture in Fig. 7.11 suggests how the points processed so far are represented.

Search WWH ::

Custom Search

Home