Databases Reference
In-Depth Information
independent. For instance, in two dimensions a cluster may be cigar-shaped,
but the cigar must not be rotated off of the axes. Figure 7.10 makes the point.
OK
OK
Not OK
Figure 7.10: The clusters in data for which the BFR algorithm may be used
can have standard deviations that differ along different axes, but the axes of
the cluster must align with the axes of the space
The BFR Algorithm begins by selecting k points, using one of the methods
discussed in Section 7.3.2. Then, the points of the data file are read in chunks.
These might be chunks from a distributed file system or a conventional file
might be partitioned into chunks of the appropriate size. Each chunk must
consist of few enough points that they can be processed in main memory. Also
stored in main memory are summaries of the k clusters and some other data,
so the entire memory is not available to store a chunk. The main-memory data
other than the chunk from the input consists of three types of objects:
1. The Discard Set : These are simple summaries of the clusters themselves.
We shall address the form of cluster summarization shortly. Note that
the cluster summaries are not “discarded”; they are in fact essential.
However, the points that the summary represents are discarded and have
no representation in main memory other than through this summary.
2. The Compressed Set : These are summaries, similar to the cluster sum-
maries, but for sets of points that have been found close to one another,
but not close to any cluster. The points represented by the compressed
set are also discarded, in the sense that they do not appear explicitly in
main memory. We call the represented sets of points miniclusters.
3. The Retained Set : Certain points can neither be assigned to a cluster nor
are they su ciently close to any other points that we can represent them
by a compressed set. These points are held in main memory exactly as
they appear in the input file.
The picture in Fig. 7.11 suggests how the points processed so far are represented.
The discard and compressed sets are represented by 2d + 1 values, if the
data is d-dimensional. These numbers are:
(a) The number of points represented, N .
Search WWH ::




Custom Search