Data Streams Classification: A Selective Ensemble with Adaptive Behavior - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

through time. In our former approach, the selection of the models involved in the classi-

fication step was chosen by a fixed activation threshold. This choice is the right solution

if it is possible to study a-priori what is the best value to assign to the threshold. In

many real environment, this information is unavailable, since the stream data behavior

cannot be modeled. In several domains, such as intrusion detection, data distribution

can remain stable for a long time, changing radically when an attack occurs.

This work presents an evolution of the system outlined in [20,19]. The new approach

introduces a complete adaptive behavior in the management of the threshold required

for the selection of the set of models actually involved in the classification. This work

describes the adaptive approach for varying the value of the model activation threshold

through time, influencing the overall behavior of the ensemble classifier, based on data

change reaction. Our approach is explicitly explained with the use of binary attributes.

This choice can be seen as a limitation, but it is worth observing that every nominal

attribute can be easily transformed into a set of binary ones. The only inability is the

direct treatment of numerical values. [14] represents a general approach to solve the on-

line discretization of numerical attributes. The proposed method is particularly suitable

in our context, since it proposes a discretization method based on two layers. The first

layer summarizes data, while the second one constructs the final binning. The process

of updating the first layer works on-line and requires a single scan over the data.

Paper Organization : Section 2 introduces our reference scenario, outlining some re-

quirements that a system working on streaming environments should satisfy. Section 3

describes our approach in details, highlighting how the requirements introduced in Sec-

tion 2.1 are verified by the proposed model. Furthermore, it present how our adaptive

selection is implemented. Section 4 presents a comparative study to understand how

the new adaptive approach guarantees a higher reliability of the system. In this section,

our approach is compared with other well-know approaches available in the literature.

Finally, Section 5 draws the conclusions and introduces some future works.

2

Data Streams Classification

Data streams represent a new challenge for the data mining community. In a stream

scenario, traditional mining methods are further constrained by the unpredictable be-

havior of a large volume of data. The latter arrives on-line at variable rates, and once an

element has been processed, it must be discarded or archived. In either cases, it cannot

be easily retrieved. Mining systems have no control over data generation, and they must

be capable of guaranteeing a near real-time response.

Definition 1. A data stream is an infinite set of elements X = X 1 ,..., X j ,... where each

X i ∈

X has a + 1 dimensions, ( x i ,... x i , y ) , and where y

∈{⊥

, 1 ,..., C

}

, and 1 ,..., C

identify the possible values in a class.

A stream can be divided into two sets based on the availability of class label y .Ifvalue

y is available in the record ( y

), it belongs to the training set. Otherwise the record

represents an element to classify, and the true label will only be available after an un-

predictable period of time.

=

⊥

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home