Information Technology Reference
In-Depth Information
and the filter models from machine learning are frequently applied [2,3]. The wrapper
model assesses the selected features by learning algorithm's performance. Therefore,
the wrapper method requires a lot of time and computational resources to find the best
feature subsets. The filter model considers statistical characteristics of dataset directly
without involving any learning algorithms. Due to the computational efficiency, the
filter method is usually used to select features from high-dimensional datasets, such as
intrusion detection systems. Moreover, this method allows to estimate feature subsets
not only by their relevance, but also by the relationships between features that make
certain features redundant. A major challenge in the IDS feature selection process is
to choose appropriate measures that can precisely determine the relevance and the
relationship between features of given dataset.
Since the relevance and the relationship are usually characterized in terms of correla-
tion or mutual information [2,3], we focus on the recently proposed generic feature
selection (GeFS) measure for intrusion detection [4]. This measure consists of two in-
stances that belong to the filter model from machine learning: the correlation feature
selection (CFS) measure and the minimal-redundancy-maximal-relevance (mRMR)
measure. In given dataset, if there are many features that are linearly correlated to each
other, then the CFS measure is recommended for selecting features. Otherwise, the
mRMR measure is alternatively chosen as it considers non-linear relations through the
analysis of mutual information between the features. The GeFS measure was success-
fully tested on the KDD CUP 1999 benchmarking dataset for IDS [9]. However, this
dataset is out of date and it was heavily criticized by the IDS community (see, for ex-
ample [7]). Moreover, the KDD CUP 1999 dataset does not contain enough HTTP
traffic for training and testing WAFs and the Web attacks of this dataset are not repre-
sentative for currently existing Web attacks. Therefore, the question about the perform-
ance of the GeFS measure perform in Web attack detection is still open.
In this paper, we propose to use the GeFS measure for selecting features in Web at-
tack detection. We conducted experiments on the ECML/PKDD 2007 dataset, which
was generated for the ECML/PKDD 2007 Discovery Challenge [6]. However, the
attack requests of this dataset were constructed blindly [6] and did not target any real
Web application. Therefore, we additionally generated our new CSIC 2010 dataset,
which contains the traffic directed to an e-commerce Web application. From our ex-
pert knowledge about Web attacks, we listed 30 features that we considered relevant
for the detection process. Then, we extracted the values of these 30 relevant features
from the datasets. By applying the GeFS measure, we wanted to know within the
particular datasets which features among the 30 extracted features are the most impor-
tant for the Web attack detection process. In order to do that, we analyzed the statisti-
cal properties of the datasets to see whether they had linear correlation or non-linear
relations between features. To do that, the data points of the datasets were visualized
in the two-dimensional space and the correlation coefficients were computed. We
then chose the CFS measure for feature selection from the CSIC 2010 dataset and the
mRMR measure for the ECML/PKDD 2007 dataset. The detection accuracies ob-
tained after the feature selection by means of four different classifiers were tested.
The experiments show that by using appropriate instances of the GeFS measure, we
could remove 63% of irrelevant and redundant features from the original dataset,
while reducing only 0.12% the detection accuracy of WAFs.
Search WWH ::




Custom Search