Database Reference
In-Depth Information
statistics over a particular time horizon, by subtracting out the
statistics at the beginning of the horizon from the statistics at the
end of the horizon.
Computational Convenience: The first and second order statis-
tics can be used to compute a vast array of cluster parameters such
as the cluster centroid and radius. This is useful in order to be
able to compute important cluster characteristics in real time.
It has been shown in [10], that the micro-cluster technique is much more
effective and versatile than the k -means based stream technique dis-
cussed in [43]. This broad technique has also been extended to a variety
of other kinds of data. Some examples of such data are as follows:
High Dimensional Data: The stream clustering method can
also be extended to the concept of projected clustering [5]. A tech-
nique for high dimensional projected clustering of data streams is
discussed in [11]. In this case, the same micro-cluster statistics
are used for maintaining the characteristics of the clusters, except
that we also maintain additional information which keeps track of
the projected dimensions in each cluster. The projected dimen-
sions can be used in conjunction with the cluster statistics to com-
pute the projected distances which are required for intermediate
computations. Another innovation proposed in [11] is the use of
decay-based approach for clustering. The idea in the decay-based
approach is relevant for the case of evolving data stream model,
and is applicable not just to the high dimensional case, but any of
the above variants of the micro-cluster model. In this approach,
the weight of a data point is defined as 2 −λ·t ,where t is the current
time-instant. Thus, each data point has a half-life of 1 ,whichis
the time in which the weight of the data point reduces by a factor
of 2. We note that the decay-based approach poses a challenge
because the micro-cluster statistics are affected at each clock tick,
even if no points arrive from the data stream. In order to deal with
this problem, a lazy approach is applied to decay-based updates, in
which we update the decay-behavior for a micro-cluster only if a
data point is added to it. The idea is that as long as we keep track
of the last time t s at which the micro-cluster was updated, we only
need to multiply the micro-cluster statistics by 2 −λ ( t c −t s ) ,where t c
is the current time instant. After multiply the decay statistics by
this factor, it is possible to add the micro-cluster statistics of the
current data point. This approach can be used since the statistics
of each micro-cluster decay by the same factor in each track, and it
is therefore possible to implicitly keep track of the decayed values,
Search WWH ::




Custom Search