Collective Classification for Spam Filtering - Computational Intelligence in Security for Information Systems

Information Technology Reference

In-Depth Information

and the filter models from machine learning are frequently applied [2,3]. The wrapper

model assesses the selected features by learning algorithm's performance. Therefore,

the wrapper method requires a lot of time and computational resources to find the best

feature subsets. The filter model considers statistical characteristics of dataset directly

without involving any learning algorithms. Due to the computational efficiency, the

filter method is usually used to select features from high-dimensional datasets, such as

intrusion detection systems. Moreover, this method allows to estimate feature subsets

not only by their relevance, but also by the relationships between features that make

certain features redundant. A major challenge in the IDS feature selection process is

to choose appropriate measures that can precisely determine the relevance and the

relationship between features of given dataset.

Since the relevance and the relationship are usually characterized in terms of correla-

tion or mutual information [2,3], we focus on the recently proposed generic feature

selection (GeFS) measure for intrusion detection [4]. This measure consists of two in-

stances that belong to the filter model from machine learning: the correlation feature

selection (CFS) measure and the minimal-redundancy-maximal-relevance (mRMR)

measure. In given dataset, if there are many features that are linearly correlated to each

other, then the CFS measure is recommended for selecting features. Otherwise, the

mRMR measure is alternatively chosen as it considers non-linear relations through the

analysis of mutual information between the features. The GeFS measure was success-

fully tested on the KDD CUP 1999 benchmarking dataset for IDS [9]. However, this

dataset is out of date and it was heavily criticized by the IDS community (see, for ex-

ample [7]). Moreover, the KDD CUP 1999 dataset does not contain enough HTTP

traffic for training and testing WAFs and the Web attacks of this dataset are not repre-

sentative for currently existing Web attacks. Therefore, the question about the perform-

ance of the GeFS measure perform in Web attack detection is still open.

In this paper, we propose to use the GeFS measure for selecting features in Web at-

tack detection. We conducted experiments on the ECML/PKDD 2007 dataset, which

was generated for the ECML/PKDD 2007 Discovery Challenge [6]. However, the

attack requests of this dataset were constructed blindly [6] and did not target any real

Web application. Therefore, we additionally generated our new CSIC 2010 dataset,

which contains the traffic directed to an e-commerce Web application. From our ex-

pert knowledge about Web attacks, we listed 30 features that we considered relevant

for the detection process. Then, we extracted the values of these 30 relevant features

from the datasets. By applying the GeFS measure, we wanted to know within the

particular datasets which features among the 30 extracted features are the most impor-

tant for the Web attack detection process. In order to do that, we analyzed the statisti-

cal properties of the datasets to see whether they had linear correlation or non-linear

relations between features. To do that, the data points of the datasets were visualized

in the two-dimensional space and the correlation coefficients were computed. We

then chose the CFS measure for feature selection from the CSIC 2010 dataset and the

mRMR measure for the ECML/PKDD 2007 dataset. The detection accuracies ob-

tained after the feature selection by means of four different classifiers were tested.

The experiments show that by using appropriate instances of the GeFS measure, we

could remove 63% of irrelevant and redundant features from the original dataset,

while reducing only 0.12% the detection accuracy of WAFs.

Computational Intelligence in Security for Information Systems

Search WWH ::

Custom Search

Home