Databases Reference
In-Depth Information
the recency. Based on stored snapshots, the user can specify the time hori-
zon for which the clustering patterns can be obtained by running an o ine
clustering component. This work can effectively get the clustering pattern
formed in the time horizon specified by the user, which is useful in applica-
tions where the user know exactly the time period he or she is interested in.
However, this work is still unable to automatically detect the concept drift
in a stream and discover the evolving patterns of the stream. There are few
works dedicated to mining changes from a data stream. Kifer et al. provides
a statistical definition of changes and proposes a change-detection algorithm
by comparing the data in some “reference window” to the data in the current
window [13]. While Aggarwal proposed a way to diagnose changes in evolving
data streams based on velocity density estimation [14]. However, this type of
works does not describe the pattern presented by the data stream in a stable
period or at a historical snapshot. The work closest to ours is the multidimen-
sional stream analysis [15]. This work applies a cube structure to organize
the streaming data. The granularity of the time dimension of the cube can
be second, minute, quarter, hour, and so on. Then each base cell of the cube
stores a compressed time series of data points arriving in the corresponding
time period. This work uses linear regression to compress the time series and
demonstrates how to aggregate the linear regression function along each di-
mension. Although the proposed architecture can facilitate OLAP queries over
stream data, it is not a suitable platform to automatically discover patterns
and pattern drifts across the section boundaries manually imposed on each
dimension. Unlike this work which differentiates numerical facts from other
dimensions, the approach proposed in this paper views both numerical and
categorical attributes as dimensions. The segmentations on the numerical at-
tributes are pre-learned from sample data extracted from the stream. More
importantly, rather than manually separating the time dimension into even
length units, our work automatically divide the time dimension into a series
of concept cycles by detected concept drift, so that each concept cycle reflects
a relatively stable concept. A base cell of our cube structure maintains the
compressed time series of statistics of the data points falling into that cell, in-
stead of the time series of numerical attribute values. In the following sections
we will show that our cube structure can facilitate not only the generation of
different types of patterns at any snapshot or within any concept cycle, but
also detecting the concept drift across multiple concept cycles, as well as the
micro-shift within a concept cycle. Furthermore, one can easily impose man-
ually determined hierarchical levels onto the natural segmentations of each
dimension in order to facilitate OLAP queries.
3 Concept and Concept Drift
Let's assume we capture the n-dimensional data within only one concept cycle
marked as c1 from a data stream and plot the data in an n-dimensional space.
Search WWH ::




Custom Search