Database Reference
In-Depth Information
(2) Add the square of the distance between p and each of the nodes q mentioned in the rep-
resentation to ROWSUM( q ). These points q include the clustroid, the k nearest points,
and the k furthest points.
We also estimate the rowsum of p , in case p needs to be part of the representation (e.g.,
it turns out to be one of the k points closest to the clustroid). Note we cannot compute
ROWSUM( p ) exactly, without going to disk and retrieving all the points of the cluster. The
estimate we use is
ROWSUM( p ) = ROWSUM( c ) + Nd 2 ( p, c )
where d ( p, c ) is the distance between p and the clustroid c . Note that N and ROWSUM( c )
in this formula are the values of these features before they were adjusted to account for the
addition of p .
We might well wonder why this estimate works. In Section 7.1.3 we discussed the
“curse of dimensionality,” in particular the observation that in a high-dimensional Euc-
lidean space, almost all angles are right angles. Of course the assumption of the GRGPF
Algorithm is that the space might not be Euclidean, but typically a non-Euclidean space
also suffers from the curse of dimensionality, in that it behaves in many ways like a high-
dimensional Euclidean space. If we assume that the angle between p , c , and another point
q in the cluster is a right angle, then the Pythagorean theorem tell us that
d 2 ( p, q ) = d 2 ( p, c ) + d 2 ( c, q )
If we sum over all q other than c , and then add d 2 ( p, c ) to ROWSUM( p ) to account for
the fact that the clustroid is one of the points in the cluster, we derive ROWSUM( p ) =
ROWSUM( c ) + Nd 2 ( p, c ).
Now, we must see if the new point p is one of the k closest or furthest points from the
clustroid, and if so, p and its rowsum become a cluster feature, replacing one of the other
features - whichever is no longer one of the k closest or furthest. We also need to consider
whether the rowsum for one of the k closest points q is now less than ROWSUM( c ). That
situation could happen if p were closer to one of these points than to the current clustroid.
If so, we swap the roles of c and q . Eventually, it is possible that the true clustroid will no
longer be one of the original k closest points. We have no way of knowing, since we do not
see the other points of the cluster in main memory. However, they are all stored on disk,
and can be brought into main memory periodically for a recomputation of the cluster fea-
tures.
Search WWH ::




Custom Search