Database Reference
In-Depth Information
Figure 7.11 Points in the discard, compressed, and retained sets
The discard and compressed sets are represented by 2 d + 1 values, if the data is d -dimen-
sional. These numbers are:
(a) The number of points represented, N .
(b) The sum of the components of all the points in each dimension. This data is a vector
SUM of length d , and the component in the i th dimension is SUM i .
(c) The sum of the squares of the components of all the points in each dimension. This
data is a vector SUMSQ of length d , and its component in the i th dimension is
SUMSQ i .
Our real goal is to represent a set of points by their count, their centroid and the standard
deviation in each dimension. However, these 2 d + 1 values give us those statistics. N is
the count. The centroid's coordinate in the i th dimension is the SUM i /N , that is the sum
in that dimension divided by the number of points. The variance in the i th dimension is
SUMSQ i /N − (SUM i /N ) 2 . We can compute the standard deviation in each dimension, since
it is the square root of the variance.
Benefits of the N , SUM , SUMSQ Representation
There is a significant advantage to representing sets of points as it is done in the BFR Algorithm, rather than by storing
N , the centroid, and the standard deviation in each dimension. Consider what we need to do when we add a new point
to a cluster. N is increased by 1, of course. But we can also add the vector representing the location of the point to SUM
to get the new SUM, and we can add the squares of the components of the vector to SUMSQ to get the new SUMSQ.
Had we used the centroid in place of SUM, then we could not adjust the centroid to account for the new point without
doing some calculation involving N , and the recomputation of the standard deviations would be far more complex as
well. Similarly, if we want to combine two sets, we just add corresponding values of N , SUM, and SUMSQ, while if
we used the centroid and standard deviations as a representation, the calculation would be far more complex.
EXAMPLE 7.9 Suppose a cluster consists of the points (5 , 1), (6 , −2), and (7 , 0). Then N =
3, SUM = [18 , −1], and SUMSQ = [110 , 5]. The centroid is SUM /N , or [6 , −1 / 3]. The vari-
ance in the first dimension is 110 / 3 − (18 / 3) 2 = 0 . 667, so the standard deviation is
Search WWH ::




Custom Search