Database Reference
In-Depth Information
7.3.4
The Algorithm of Bradley, Fayyad, and Reina
This algorithm, which we shall refer to as BFR after its authors, is a variant of k -means that
is designed to cluster data in a high-dimensional Euclidean space. It makes a very strong
assumption about the shape of clusters: they must be normally distributed about a centroid.
The mean and standard deviation for a cluster may differ for different dimensions, but the
dimensions must be independent. For instance, in two dimensions a cluster may be cigar-
shaped, but the cigar must not be rotated off of the axes. Figure 7.10 makes the point.
Figure 7.10 The clusters in data for which the BFR algorithm may be used can have standard deviations that differ
along different axes, but the axes of the cluster must align with the axes of the space
The BFR Algorithm begins by selecting k points, using one of the methods discussed in
Section 7.3.2 . Then, the points of the data file are read in chunks. These might be chunks
from a distributed file system or a conventional file might be partitioned into chunks of the
appropriate size. Each chunk must consist of few enough points that they can be processed
in main memory. Also stored in main memory are summaries of the k clusters and some
other data, so the entire memory is not available to store a chunk. The main-memory data
other than the chunk from the input consists of three types of objects:
(1) The Discard Set : These are simple summaries of the clusters themselves. We shall ad-
dress the form of cluster summarization shortly. Note that the cluster summaries are
not “discarded”; they are in fact essential. However, the points that the summary rep-
resents are discarded and have no representation in main memory other than through
this summary.
(2) The Compressed Set : These are summaries, similar to the cluster summaries, but for
sets of points that have been found close to one another, but not close to any cluster.
The points represented by the compressed set are also discarded, in the sense that they
do not appear explicitly in main memory. We call the represented sets of points min-
iclusters .
(3) The Retained Set : Certain points can neither be assigned to a cluster nor are they suffi-
ciently close to any other points that we can represent them by a compressed set. These
points are held in main memory exactly as they appear in the input file.
The picture in Fig. 7.11 suggests how the points processed so far are represented.
Search WWH ::




Custom Search