Information Technology Reference
In-Depth Information
Ta b l e 1 .
Description of data sets
dataSet
#inst #attrs #irrAttrs #classes %noise #drifts
LED24
a
/
b
/
test
10
k
/ 100
k
/25
k
24
17
10
10%
none
Hyper
a
/
b
/
test
10
k
/ 100
k
/25
k
15
0
2
0
none
Stagger
a
/
test
1200 / 120
k
9
0
2
0
3 (every 400) / (every 40
k
)
12
k
/ 1200
k
9
0
2
0
3 (every 4
k
) / (every 400
k
)
Stagger
b
/
test
10
k
/ 100
k
/ 1000
k
15
0
2
10%
20 (every 500) / (every 5
k
) / (every 50
k
)
cHyper
a
/
b
/
c
250
k
15
0
2
10%
20 (every 12
.
5
k
)
cHyper
test
600
k
25
0
2
5%
15
×
4(every10
k
)
Cyclic
Cyclic
test
150
k
25
0
2
5%
15
×
4(every2
.
5
k
)
introduces noise by randomly flipping the label of a tuple with a given probability.
Two additional data sets, namely
Hyper
and
Cyclic
are generated using this ap-
proach.
Hyper
does not consider any drifts, while
Cyclic
proposes the problem of
periodic recurring concepts.
KddCup99
:
this real data set concerns the significant problem of automatic and
real-time detection of cyber attacks [31]. The data includes a series of network
connections collected from two weeks of LAN network traffic. Each record can
either correspond to a normal connection or an intrusive one. Each connection is
represented by 42 attributes (34 numerical), such as the
duration
of the connection,
the number of
bytes transmitted
,andthe
type of protocol
used, e.g. tcp, udp. The
data contains 23 training attack types, that can be further aggregated into four cate-
gories, namely
DOS
,
R2L
,
U2R
,and
Probing
. Due to its instable nature,
KddCup99
is largely employed to evaluate several data streams classification systems, includ-
ing [3,16]
The features of the data sets actually employed are reported in Table 1. The stable
LED24
and
Hyper
are useful for testing whether the mechanism for change reaction has
implications for the reliability of the systems. The evolving data sets test different fea-
tures of a stream classification system. The
Stagger
problem verifies, if all the systems
can cope with concept drift, without considering any problem dimensionality. Then, the
problem of learning in the presence of concept drifting is evaluated with the other data
sets, also considering a huge quantity of data with
cHyper
.
4.2
Systems
Different popular stream ensemble methods are introduced in our experiments. All the
systems expect the data streams to be divided into chunks based on a well-defined value.
All the approaches are implemented in Java 1.6 with MOA [32] and WEKA libraries
[33] for the implementation of the basic learners and employ complete non-approximate
data for the mining task.
Fix
:
This approach is the simplest one. It considers a fixed set of classifiers, managed
as a FIFO queue. Every classifier is unconditionally inserted in the ensemble, re-
moving the oldest one, when the ensemble is full.