Capturing Concepts and Detecting Concept-Drift from Potential Unbounded, Ever-Evolving and High-Dimensional Data Streams - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

Fig. 1. Descriptions of concept C1 and C2

All the dense areas in the n-dimension space can be viewed as the approximate

description for c1. When the data captured is streaming, it surely takes time

for the description of the concept to be identifiable. During this time period,

some dense areas may appear earlier, some may appear later. Let's assume

we also capture the data within the concept cycle immediately following the

concept cycle c 1andmarkitas c 2. Again we can approximately describe

c 2 by using its dense areas (see Fig. 1). Now, the question that needs to be

addressed is how to identify the boundary between c 1and c 2? In other words,

when the data is streaming, how can we know whether it is still in the forming

period of cycle c 1 or it has already entered into cycle c 2? In order to address

this problem, we first view the dense area in a concept cycle as composed

of a group of adjacent dense cells. The size of the dense cell is learned from

static training data extracted from the corresponding data stream, which is

discussed in Sect. 4. When the data is streaming, data points keep falling into

the corresponding cells, making some of the cells hold the number of points

exceeding certain threshold θ n ( θ n is learned from static training data as well)

and become dense. We stamp each time point when a cell becomes dense. In

this way, we maintain a time series of timestamps that mark the occurrences

of new dense cells. Whenever a new timestamp is added to this time series,

a linear regression is conducted to predict the next timestamp t pred .When

the real next timestamp tnext is marked, we calculate the difference between

tnext and tpred. If ( t next −

t pred ) >θ t , we view the new dense cell formed at

Search WWH ::

Custom Search

Home