Information Technology Reference
In-Depth Information
Ta b l e 1 . Description of data sets
dataSet
#inst #attrs #irrAttrs #classes %noise #drifts
LED24 a / b / test
10 k / 100 k /25 k
24
17
10
10%
none
Hyper a / b / test
10 k / 100 k /25 k
15
0
2
0
none
Stagger a / test
1200 / 120 k
9
0
2
0
3 (every 400) / (every 40 k )
12 k / 1200 k
9
0
2
0
3 (every 4 k ) / (every 400 k )
Stagger b / test
10 k / 100 k / 1000 k
15
0
2
10%
20 (every 500) / (every 5 k ) / (every 50 k )
cHyper a / b / c
250 k
15
0
2
10%
20 (every 12 . 5 k )
cHyper test
600 k
25
0
2
5%
15 × 4(every10 k )
Cyclic
Cyclic test
150 k
25
0
2
5%
15 × 4(every2 . 5 k )
introduces noise by randomly flipping the label of a tuple with a given probability.
Two additional data sets, namely Hyper and Cyclic are generated using this ap-
proach. Hyper does not consider any drifts, while Cyclic proposes the problem of
periodic recurring concepts.
KddCup99 : this real data set concerns the significant problem of automatic and
real-time detection of cyber attacks [31]. The data includes a series of network
connections collected from two weeks of LAN network traffic. Each record can
either correspond to a normal connection or an intrusive one. Each connection is
represented by 42 attributes (34 numerical), such as the duration of the connection,
the number of bytes transmitted ,andthe type of protocol used, e.g. tcp, udp. The
data contains 23 training attack types, that can be further aggregated into four cate-
gories, namely DOS , R2L , U2R ,and Probing . Due to its instable nature, KddCup99
is largely employed to evaluate several data streams classification systems, includ-
ing [3,16]
The features of the data sets actually employed are reported in Table 1. The stable
LED24 and Hyper are useful for testing whether the mechanism for change reaction has
implications for the reliability of the systems. The evolving data sets test different fea-
tures of a stream classification system. The Stagger problem verifies, if all the systems
can cope with concept drift, without considering any problem dimensionality. Then, the
problem of learning in the presence of concept drifting is evaluated with the other data
sets, also considering a huge quantity of data with cHyper .
4.2
Systems
Different popular stream ensemble methods are introduced in our experiments. All the
systems expect the data streams to be divided into chunks based on a well-defined value.
All the approaches are implemented in Java 1.6 with MOA [32] and WEKA libraries
[33] for the implementation of the basic learners and employ complete non-approximate
data for the mining task.
Fix : This approach is the simplest one. It considers a fixed set of classifiers, managed
as a FIFO queue. Every classifier is unconditionally inserted in the ensemble, re-
moving the oldest one, when the ensemble is full.
 
Search WWH ::




Custom Search