Data Streams Classification: A Selective Ensemble with Adaptive Behavior - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

Given Definition 1, the notion of concept drift can be easily defined. As reported in

[23], a data stream can be divided into batches, namely b 1 , b 2 ,..., b n . For each batch

b i , data is independently distributed w.r.t. a distribution P i () . Depending on the amount

and type of concept drift, P i () will differ from P i + 1 () . A typical example is customers'

buying preferences, which may change according to the day of the week, inflation

rate and/or availability of alternatives. Two main types of concept drift are usually

distinguished in the literature, i.e. abrupt and gradual . Abrupt changes imply a rad-

ical variation of data distribution from a given point in time, while gradual changes

are characterized by a constant variation during a period of time. The concept drifting

phenomenon involves data expiration directly, forcing stream mining systems to be con-

tinuously updated to keep track of changes. This implies making time-critical decisions

for huge volumes of high-speed streaming data.

2.1

Requirements

As introduced in Section 2, the stream features influence the development of a data

streams classifier radically. A set of requirements must be taken into account before

proposing a new approach. These needs highlight several implementation decisions in-

serted in our approach.

Since data streams can be potentially unbounded in size, and data arrives at unpre-

dictable rates, there are rigid constraints on time and memory required by a system

through time:

Req. 1: the time required for processing every single stream element must be constant,

which implies that every data sample can be analyzed almost only once.

Req. 2: the memory needed to store all the statistics required by the system must be

constant in time, and it cannot be related to the number of elements analyzed.

Req. 3: the system must be capable of updating their structures readily, working within

a limited time span, and guaranteeing an acceptable level of reliability.

Given Definition 1, the elements to classify can arrive in every moment during the data

flow.

Req. 4: the system must be able to classify unseen elements every time during its com-

putation.

Req. 5: the system should be able to manage a set of models that do not necessarily

include contiguous ones, i.e. classifiers extracted using adjacent parts of the stream.

2.2

Related Work

Mining data streams has rapidly become an important and challenging research field.

As proposed in [12], the available solutions can be classified into data-based and task-

based ones. In the former approaches a data stream is transformed into an approximate

smaller-size representation, while task-based techniques employ methods from com-

putational theory to achieve time and space efficient solutions. Aggregation [1,2,3],

sampling [10] or summarized data structure , such as histograms [21,17], are popular

Search WWH ::

Custom Search

Home