Database Reference
In-Depth Information
2 p , in which case we may have to merge buckets of increasing sizes, recursively, just as in
Section 4.6 .
To merge two consecutive buckets, we need to do several things:
(1) The size of the bucket is twice the sizes of the two buckets being merged.
(2) The timestamp for the merged bucket is the timestamp of the more recent of the two
consecutive buckets.
(3) We must consider whether to merge clusters, and if so, we need to compute the para-
meters of the merged clusters. We shall elaborate on this part of the algorithm by con-
sidering several examples of criteria for merging and ways to estimate the needed para-
meters.
EXAMPLE 7.14 Perhaps the simplest case is where we are using a k -means approach in a
Euclidean space. We represent clusters by the count of their points and their centroids. Each
bucket has exactly k clusters, so we can pick p = k , or we can pick p larger than k and cluster
the p points into k clusters when we create a bucket initially as in Section 7.6.3 . We must
find the best matching between the k clusters of the first bucket and the k clusters of the
second. Here, “best” means the matching that minimizes the sum of the distances between
the centroids of the matched clusters.
Note that we do not consider merging two clusters from the same bucket, because our
assumption is that clusters do not evolve too much between consecutive buckets. Thus, we
would expect to find in each of two adjacent buckets a representation of each of the k “true”
clusters that exist in the stream.
When we decide to merge two clusters, one from each bucket, the number of points in the
merged cluster is surely the sum of the numbers of points in the two clusters. The centroid
of the merged cluster is the weighted average of the centroids of the two clusters, where the
weighting is by the numbers of points in the clusters. That is, if the two clusters have n 1 and
n 2 points, respectively, and have centroids c 1 and c 2 (the latter are d -dimensional vectors
for some d ), then the combined cluster has n = n 1 + n 2 points and has centroid
EXAMPLE 7.15 The method of Example 7.14 suffices when the clusters are changing very
slowly. Suppose we might expect the cluster centroids to migrate sufficiently quickly that
when matching the centroids from two consecutive buckets, we might be faced with an am-
biguous situation, where it is not clear which of two clusters best matches a given cluster
from the other bucket. One way to protect against such a situation is to create more than k
clusters in each bucket, even if we know that, when we query (see Section 7.6.5 ) , we shall
have to merge into exactly k clusters. For example, we might choose p to be much larger
Search WWH ::




Custom Search