Information Technology Reference
In-Depth Information
Given Definition 1, the notion of concept drift can be easily defined. As reported in
[23], a data stream can be divided into batches, namely b 1 , b 2 ,..., b n . For each batch
b i , data is independently distributed w.r.t. a distribution P i () . Depending on the amount
and type of concept drift, P i () will differ from P i + 1 () . A typical example is customers'
buying preferences, which may change according to the day of the week, inflation
rate and/or availability of alternatives. Two main types of concept drift are usually
distinguished in the literature, i.e. abrupt and gradual . Abrupt changes imply a rad-
ical variation of data distribution from a given point in time, while gradual changes
are characterized by a constant variation during a period of time. The concept drifting
phenomenon involves data expiration directly, forcing stream mining systems to be con-
tinuously updated to keep track of changes. This implies making time-critical decisions
for huge volumes of high-speed streaming data.
2.1
Requirements
As introduced in Section 2, the stream features influence the development of a data
streams classifier radically. A set of requirements must be taken into account before
proposing a new approach. These needs highlight several implementation decisions in-
serted in our approach.
Since data streams can be potentially unbounded in size, and data arrives at unpre-
dictable rates, there are rigid constraints on time and memory required by a system
through time:
Req. 1: the time required for processing every single stream element must be constant,
which implies that every data sample can be analyzed almost only once.
Req. 2: the memory needed to store all the statistics required by the system must be
constant in time, and it cannot be related to the number of elements analyzed.
Req. 3: the system must be capable of updating their structures readily, working within
a limited time span, and guaranteeing an acceptable level of reliability.
Given Definition 1, the elements to classify can arrive in every moment during the data
flow.
Req. 4: the system must be able to classify unseen elements every time during its com-
putation.
Req. 5: the system should be able to manage a set of models that do not necessarily
include contiguous ones, i.e. classifiers extracted using adjacent parts of the stream.
2.2
Related Work
Mining data streams has rapidly become an important and challenging research field.
As proposed in [12], the available solutions can be classified into data-based and task-
based ones. In the former approaches a data stream is transformed into an approximate
smaller-size representation, while task-based techniques employ methods from com-
putational theory to achieve time and space efficient solutions. Aggregation [1,2,3],
sampling [10] or summarized data structure , such as histograms [21,17], are popular
Search WWH ::




Custom Search