Capturing Concepts and Detecting Concept-Drift from Potential Unbounded, Ever-Evolving and High-Dimensional Data Streams - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

comprehensive knowledge discovery architecture should be built solely upon

one-pass scan of potentially unbounded, ever evolving and high dimensional

data stream. In order to tackle this challenge, in this paper, we (1) proposes

a unique cube structure, where the time dimension is dynamically segmented

by concept drifts into concept cycles, the sub-cube at each concept cycle only

consists of cells that are active at that cycle, and the facts of each cell are

compressed time series of statistics summaries of the events occurred in that

cell over a concept cycle; (2) explores an e cient in-memory data structure to

organize the sub-cube for the current concept cycle, capture and compress sta-

tistics of events occurred in each active cell, and automatically detect concept

drifts; (3) investigates effectively compressed and e ciently indexed storage

structures to warehouse historical concepts; (4) develops online algorithms to

query and visualize the current or a historical concept from different perspec-

tives solely based on synopses stored in the disk. The rest of this paper is

organized as follows. A brief review of related works is presented in Sect. 2. In

Sect. 3, we define concept and concept drift, and describe the data structure

used to compress and store concepts. Then, in Sect. 4, we present the data

structures and algorithms for real-time capturing concept and detecting con-

cept drift. In Sect. 5, we discuses the learning algorithm for pre-determining a

couple of parameters. The online algorithms to query and visualize the current

or a historical concept from different perspectives are presented in Sect. 6. In

Sect. 7, preliminary experimental results are presented. Finally, we will reach

our conclusion in Sect. 8.

2 Related Work

A large number of works focus on building adaptive classification models to

deal with concept drifts occurred in data stream [1-8]. Two major types of

methodologies are adopted to serve this purpose. The first type is to build in-

cremental classification model that continuously incorporating new data and

fading the effects of old examples at certain rates [2,4,7,8]. The other type of

methods is to take advantage of classifier ensembles [3,5,6], where the weights

of classifiers are continuously modified in order to adapt to concept drift. Ob-

viously, this group of works process data streams that are mixed with labeled

and unlabeled data. But in many cases, class labels are not available in data

streams. Another problem is that these works can not detect evolving pattern

itself. The unsupervised learning (clustering algorithms) are thought to be a

challenging problem to data streams [9]. Some one-pass versions of clustering

algorithms have been proposed to tackle the scalability problem brought by

unbounded data streams [10-12]. Arguing that one-pass clustering algorithms

over an entire data stream suffer from the heavy effects of the outdated his-

torical data, Aggarwal et al. proposed an interesting method to explore clus-

tering patterns over different time windows [9]. This work stores statistical

snapshots of the data stream at different level of granularity depending upon

Search WWH ::

Custom Search

Home