Clustering - Mining of Massive Datasets

Databases Reference

In-Depth Information

independent. For instance, in two dimensions a cluster may be cigar-shaped,

but the cigar must not be rotated off of the axes. Figure 7.10 makes the point.

OK

Not OK

Figure 7.10: The clusters in data for which the BFR algorithm may be used

can have standard deviations that differ along different axes, but the axes of

the cluster must align with the axes of the space

The BFR Algorithm begins by selecting k points, using one of the methods

discussed in Section 7.3.2. Then, the points of the data file are read in chunks.

These might be chunks from a distributed file system or a conventional file

might be partitioned into chunks of the appropriate size. Each chunk must

consist of few enough points that they can be processed in main memory. Also

stored in main memory are summaries of the k clusters and some other data,

so the entire memory is not available to store a chunk. The main-memory data

other than the chunk from the input consists of three types of objects:

1. The Discard Set : These are simple summaries of the clusters themselves.

We shall address the form of cluster summarization shortly. Note that

the cluster summaries are not “discarded”; they are in fact essential.

However, the points that the summary represents are discarded and have

no representation in main memory other than through this summary.

2. The Compressed Set : These are summaries, similar to the cluster sum-

maries, but for sets of points that have been found close to one another,

but not close to any cluster. The points represented by the compressed

set are also discarded, in the sense that they do not appear explicitly in

main memory. We call the represented sets of points miniclusters.

3. The Retained Set : Certain points can neither be assigned to a cluster nor

are they su ciently close to any other points that we can represent them

by a compressed set. These points are held in main memory exactly as

they appear in the input file.

The picture in Fig. 7.11 suggests how the points processed so far are represented.

The discard and compressed sets are represented by 2d + 1 values, if the

data is d-dimensional. These numbers are:

(a) The number of points represented, N .

Search WWH ::

Custom Search

Home