Database Reference
In-Depth Information
(a) Add p to a cluster if it not only has the centroid closest to p , but it is very unlikely that,
after all the points have been processed, some other cluster centroid will be found to
be nearer to p . This decision is a complex statistical calculation. It must assume that
points are ordered randomly, and that we know how many points will be processed in
the future. Its advantage is that if we find one centroid to be significantly closer to p
than any other, we can add p to that cluster and be done with it, even if p is very far
from all centroids.
(b) We can measure the probability that, if p belongs to a cluster, it would be found as far
as it is from the centroid of that cluster. This calculation makes use of the fact that we
believe each cluster to consist of normally distributed points with the axes of the distri-
bution aligned with the axes of the space. It allows us to make the calculation through
the Mahalanobis distance of the point, which we shall describe next.
The Mahalanobis distance is essentially the distance between a point and the centroid of
a cluster, normalized by the standard deviation of the cluster in each dimension. Since the
BFR Algorithm assumes the axes of the cluster align with the axes of the space, the com-
putation of Mahalanobis distance is especially simple. Let p = [ p 1 , p 2 , . . . , p d ] be a point
and c = [ c 1 , c 2 , . . . , c d ] the centroid of a cluster. Let σ i be the standard deviation of points
in the cluster in the i th dimension. Then the Mahalanobis distance between p and c is
That is, we normalize the difference between p and c in the i th dimension by dividing by
the standard deviation of the cluster in that dimension. The rest of the formula combines
the normalized distances in each dimension in the normal way for a Euclidean space.
To assign point p to a cluster, we compute the Mahalanobis distance between p and each
of the cluster centroids. We choose that cluster whose centroid has the least Mahalanobis
distance, and we add p to that cluster provided the Mahalanobis distance is less than a
threshold. For instance, suppose we pick four as the threshold. If data is normally distrib-
uted, then the probability of a value as far as four standard deviations from the mean is less
than one in a million. Thus, if the points in the cluster are really normally distributed, then
the probability that we will fail to include a point that truly belongs is less than 10 6 . And
such a point is likely to be assigned to that cluster eventually anyway, as long as it does not
wind up closer to some other centroid as centroids migrate in response to points added to
their cluster.
Search WWH ::




Custom Search