Data Streams Classification: A Selective Ensemble with Adaptive Behavior - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

Ta b l e 1 . Description of data sets

dataSet

#inst #attrs #irrAttrs #classes %noise #drifts

LED24 a / b / test

10 k / 100 k /25 k

10%

none

Hyper a / b / test

10 k / 100 k /25 k

none

Stagger a / test

1200 / 120 k

3 (every 400) / (every 40 k )

12 k / 1200 k

3 (every 4 k ) / (every 400 k )

Stagger b / test

10 k / 100 k / 1000 k

10%

20 (every 500) / (every 5 k ) / (every 50 k )

cHyper a / b / c

250 k

10%

20 (every 12 . 5 k )

cHyper test

600 k

15 × 4(every10 k )

Cyclic

Cyclic test

150 k

15 × 4(every2 . 5 k )

introduces noise by randomly flipping the label of a tuple with a given probability.

Two additional data sets, namely Hyper and Cyclic are generated using this ap-

proach. Hyper does not consider any drifts, while Cyclic proposes the problem of

periodic recurring concepts.

KddCup99 : this real data set concerns the significant problem of automatic and

real-time detection of cyber attacks [31]. The data includes a series of network

connections collected from two weeks of LAN network traffic. Each record can

either correspond to a normal connection or an intrusive one. Each connection is

represented by 42 attributes (34 numerical), such as the duration of the connection,

the number of bytes transmitted ,andthe type of protocol used, e.g. tcp, udp. The

data contains 23 training attack types, that can be further aggregated into four cate-

gories, namely DOS , R2L , U2R ,and Probing . Due to its instable nature, KddCup99

is largely employed to evaluate several data streams classification systems, includ-

ing [3,16]

The features of the data sets actually employed are reported in Table 1. The stable

LED24 and Hyper are useful for testing whether the mechanism for change reaction has

implications for the reliability of the systems. The evolving data sets test different fea-

tures of a stream classification system. The Stagger problem verifies, if all the systems

can cope with concept drift, without considering any problem dimensionality. Then, the

problem of learning in the presence of concept drifting is evaluated with the other data

sets, also considering a huge quantity of data with cHyper .

4.2

Systems

Different popular stream ensemble methods are introduced in our experiments. All the

systems expect the data streams to be divided into chunks based on a well-defined value.

All the approaches are implemented in Java 1.6 with MOA [32] and WEKA libraries

[33] for the implementation of the basic learners and employ complete non-approximate

data for the mining task.

Fix : This approach is the simplest one. It considers a fixed set of classifiers, managed

as a FIFO queue. Every classifier is unconditionally inserted in the ensemble, re-

moving the oldest one, when the ensemble is full.

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home