Capturing Concepts and Detecting Concept-Drift from Potential Unbounded, Ever-Evolving and High-Dimensional Data Streams - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

the recency. Based on stored snapshots, the user can specify the time hori-

zon for which the clustering patterns can be obtained by running an o ine

clustering component. This work can effectively get the clustering pattern

formed in the time horizon specified by the user, which is useful in applica-

tions where the user know exactly the time period he or she is interested in.

However, this work is still unable to automatically detect the concept drift

in a stream and discover the evolving patterns of the stream. There are few

works dedicated to mining changes from a data stream. Kifer et al. provides

a statistical definition of changes and proposes a change-detection algorithm

by comparing the data in some “reference window” to the data in the current

window [13]. While Aggarwal proposed a way to diagnose changes in evolving

data streams based on velocity density estimation [14]. However, this type of

works does not describe the pattern presented by the data stream in a stable

period or at a historical snapshot. The work closest to ours is the multidimen-

sional stream analysis [15]. This work applies a cube structure to organize

the streaming data. The granularity of the time dimension of the cube can

be second, minute, quarter, hour, and so on. Then each base cell of the cube

stores a compressed time series of data points arriving in the corresponding

time period. This work uses linear regression to compress the time series and

demonstrates how to aggregate the linear regression function along each di-

mension. Although the proposed architecture can facilitate OLAP queries over

stream data, it is not a suitable platform to automatically discover patterns

and pattern drifts across the section boundaries manually imposed on each

dimension. Unlike this work which differentiates numerical facts from other

dimensions, the approach proposed in this paper views both numerical and

categorical attributes as dimensions. The segmentations on the numerical at-

tributes are pre-learned from sample data extracted from the stream. More

importantly, rather than manually separating the time dimension into even

length units, our work automatically divide the time dimension into a series

of concept cycles by detected concept drift, so that each concept cycle reflects

a relatively stable concept. A base cell of our cube structure maintains the

compressed time series of statistics of the data points falling into that cell, in-

stead of the time series of numerical attribute values. In the following sections

we will show that our cube structure can facilitate not only the generation of

different types of patterns at any snapshot or within any concept cycle, but

also detecting the concept drift across multiple concept cycles, as well as the

micro-shift within a concept cycle. Furthermore, one can easily impose man-

ually determined hierarchical levels onto the natural segmentations of each

dimension in order to facilitate OLAP queries.

3 Concept and Concept Drift

Let's assume we capture the n-dimensional data within only one concept cycle

marked as c1 from a data stream and plot the data in an n-dimensional space.

Search WWH ::

Custom Search

Home