Databases Reference
In-Depth Information
comprehensive knowledge discovery architecture should be built solely upon
one-pass scan of potentially unbounded, ever evolving and high dimensional
data stream. In order to tackle this challenge, in this paper, we (1) proposes
a unique cube structure, where the time dimension is dynamically segmented
by concept drifts into concept cycles, the sub-cube at each concept cycle only
consists of cells that are active at that cycle, and the facts of each cell are
compressed time series of statistics summaries of the events occurred in that
cell over a concept cycle; (2) explores an e cient in-memory data structure to
organize the sub-cube for the current concept cycle, capture and compress sta-
tistics of events occurred in each active cell, and automatically detect concept
drifts; (3) investigates effectively compressed and e ciently indexed storage
structures to warehouse historical concepts; (4) develops online algorithms to
query and visualize the current or a historical concept from different perspec-
tives solely based on synopses stored in the disk. The rest of this paper is
organized as follows. A brief review of related works is presented in Sect. 2. In
Sect. 3, we define concept and concept drift, and describe the data structure
used to compress and store concepts. Then, in Sect. 4, we present the data
structures and algorithms for real-time capturing concept and detecting con-
cept drift. In Sect. 5, we discuses the learning algorithm for pre-determining a
couple of parameters. The online algorithms to query and visualize the current
or a historical concept from different perspectives are presented in Sect. 6. In
Sect. 7, preliminary experimental results are presented. Finally, we will reach
our conclusion in Sect. 8.
2 Related Work
A large number of works focus on building adaptive classification models to
deal with concept drifts occurred in data stream [1-8]. Two major types of
methodologies are adopted to serve this purpose. The first type is to build in-
cremental classification model that continuously incorporating new data and
fading the effects of old examples at certain rates [2,4,7,8]. The other type of
methods is to take advantage of classifier ensembles [3,5,6], where the weights
of classifiers are continuously modified in order to adapt to concept drift. Ob-
viously, this group of works process data streams that are mixed with labeled
and unlabeled data. But in many cases, class labels are not available in data
streams. Another problem is that these works can not detect evolving pattern
itself. The unsupervised learning (clustering algorithms) are thought to be a
challenging problem to data streams [9]. Some one-pass versions of clustering
algorithms have been proposed to tackle the scalability problem brought by
unbounded data streams [10-12]. Arguing that one-pass clustering algorithms
over an entire data stream suffer from the heavy effects of the outdated his-
torical data, Aggarwal et al. proposed an interesting method to explore clus-
tering patterns over different time windows [9]. This work stores statistical
snapshots of the data stream at different level of granularity depending upon
Search WWH ::




Custom Search