Information Technology Reference
In-Depth Information
through time. In our former approach, the selection of the models involved in the classi-
fication step was chosen by a fixed activation threshold. This choice is the right solution
if it is possible to study a-priori what is the best value to assign to the threshold. In
many real environment, this information is unavailable, since the stream data behavior
cannot be modeled. In several domains, such as intrusion detection, data distribution
can remain stable for a long time, changing radically when an attack occurs.
This work presents an evolution of the system outlined in [20,19]. The new approach
introduces a complete adaptive behavior in the management of the threshold required
for the selection of the set of models actually involved in the classification. This work
describes the adaptive approach for varying the value of the model activation threshold
through time, influencing the overall behavior of the ensemble classifier, based on data
change reaction. Our approach is explicitly explained with the use of binary attributes.
This choice can be seen as a limitation, but it is worth observing that every nominal
attribute can be easily transformed into a set of binary ones. The only inability is the
direct treatment of numerical values. [14] represents a general approach to solve the on-
line discretization of numerical attributes. The proposed method is particularly suitable
in our context, since it proposes a discretization method based on two layers. The first
layer summarizes data, while the second one constructs the final binning. The process
of updating the first layer works on-line and requires a single scan over the data.
Paper Organization : Section 2 introduces our reference scenario, outlining some re-
quirements that a system working on streaming environments should satisfy. Section 3
describes our approach in details, highlighting how the requirements introduced in Sec-
tion 2.1 are verified by the proposed model. Furthermore, it present how our adaptive
selection is implemented. Section 4 presents a comparative study to understand how
the new adaptive approach guarantees a higher reliability of the system. In this section,
our approach is compared with other well-know approaches available in the literature.
Finally, Section 5 draws the conclusions and introduces some future works.
2
Data Streams Classification
Data streams represent a new challenge for the data mining community. In a stream
scenario, traditional mining methods are further constrained by the unpredictable be-
havior of a large volume of data. The latter arrives on-line at variable rates, and once an
element has been processed, it must be discarded or archived. In either cases, it cannot
be easily retrieved. Mining systems have no control over data generation, and they must
be capable of guaranteeing a near real-time response.
Definition 1. A data stream is an infinite set of elements X = X 1 ,..., X j ,... where each
X i
X has a + 1 dimensions, ( x i ,... x i , y ) , and where y
∈{⊥
, 1 ,..., C
}
, and 1 ,..., C
identify the possible values in a class.
A stream can be divided into two sets based on the availability of class label y .Ifvalue
y is available in the record ( y
), it belongs to the training set. Otherwise the record
represents an element to classify, and the true label will only be available after an un-
predictable period of time.
=
 
Search WWH ::




Custom Search