Information Technology Reference
In-Depth Information
7.1
INTRODUCTION
Learning from data streams has been featured in many practical applications
such as network traffic monitoring and credit fraud identification [1]. Generally
speaking, a data stream is a sequence of unbounded, real-time data items with a
very high rate that can be read only once by an application [2]. The restriction at
the end of this definition is also called a one-pass constraint [3], which is also
confirmed by other studies [4-6]. The study of learning from data streams has
been a quite popular topic in the machine learning community. Domingos and
Hulten [7] proposed the very fast decision tree (VFDT) to address data mining
from high-speed data streams such as web access data. By using Hoeffding
bounds, it can offer an approximately identical performance as a conventional
learner on a static dataset. Learn++ [8] approaches learning from data streams
through an aggressive ensemble-of-ensemble learning paradigm. Briefly
speaking, Learn++ processes the data streams in units of data chunks. For each
data chunk, Learn++ applies the base learner of multilayer perceptron (MLP)
to create multiple ensemble hypotheses on it. He and Chen [9] proposed the
incremental multiple-object recognition and localization (IMORL) framework
to address learning from video data streams. It calculates the Euclidean distance
in feature space between examples within consecutive data chunks to transmit
sampling weights in a biased manner, that is, hard-to-learn examples would
gradually be assigned higher weights for learning, which resembles AdaBoost's
weight-updating mechanism [10] to some degree. In Literature [11], an approach
to real-time generation of fuzzy rule-based systems of eXtended Takagi-Sugeno
(xTS) type from data streams was proposed, which applies an incremental clus-
tering procedure to generate clusters to form fuzzy rule-based systems. Georgieva
and Filev [12] proposed the Gustafson-Kessel algorithm for incremental clus-
tering of data streams. It applies the adaptive-distance metric to identify clusters
with different shapes and orientations. As a follow-up, Filev and Georgieva [13]
extended the Gustafson-Kessel algorithm to enable real-time clustering of data
streams.
Inability to store all data into memory for learning as done by traditional
approaches has not been the sole challenge presented by data streams. As the
term concept drift suggests, it is undesirable, yet inevitable, that class concepts
evolve as data streams move forward. This property, combined with a virtually
unbounded volume of data streams, accounts for the so-called stability-plasticity
dilemma [14]. One may be trapped in an endless loop of pondering, either reserv-
ing just the most recent knowledge to battle against concept drift or keeping track
of knowledge as much as possible in avoidance of “catastrophic forgetting” [14].
Many studies have been recorded to strike a balance on this dilemma. Marked
by an effort of adapting an ensemble approach to time-evolving data streams,
streaming ensemble algorithm (SEA) [15] maintains an ensemble pool of C4.5
hypotheses with a fixed size, each of which is built on a data chunk with a unique
timestamp. When a request to insert a new hypothesis is made but the ensemble
pool is fully occupied, some criterion is introduced to evaluate whether the new
Search WWH ::




Custom Search