Application of the Generic Feature Selection Measure in Detection of Web Attacks (Machine Learning and Intelligence)

Abstract

Feature selection for filtering HTTP-traffic in Web application firewalls (WAFs) is an important task. We focus on the Generic-Feature-Selection (GeFS) measure [4], which was successfully tested on low-level package filters, i.e., the KDD CUP’99 dataset. However, the performance of the GeFS measure in analyzing high-level HTTP-traffic is still unknown. In this paper we study the GeFS measure for WAFs. We conduct experiments on the publicly available ECML/PKDD-2007 dataset. Since this dataset does not target any real Web application, we additionally generate our new CSIC-2010 dataset. We analyze the statistical properties of both two datasets to provide more in-sides of their nature and quality. Subsequently, we determine appropriate instances of the GeFS measure for feature selection. We use different classifiers to test the detection accuracies. The experiments show that we can remove 63% of irrelevant and redundant features from the original dataset, while reducing only 0.12% the detection accuracy of WAFs.

Keywords: Web attack detection, Web application firewall, intrusion detection systems, feature selection, machine learning algorithms.

Introduction

Web attacks pose many serious threats to modern Internet. The number of Web attacks is steadily increasing, consequently Web application firewalls (WAFs) [8] need to be more and more effective. One of the approaches for improving the effectiveness of WAFs is to apply the feature selection methods. Achieving reduction of the number of relevant traffic features without negative effect on detection accuracy is a goal that greatly increases the available processing time of WAFs and reduces the required system resources. As there exist many feature selection algorithms (see, for example [2,3]), the question that arises is which ones could be applied in intrusion detection in general and in Web attack detection in particular. The most of the feature selection work in intrusion practice is still done manually and the quality of selected features depends strongly on expert knowledge. For automatic feature selection, the wrapper and the filter models from machine learning are frequently applied [2,3]. The wrapper model assesses the selected features by learning algorithm’s performance. Therefore, the wrapper method requires a lot of time and computational resources to find the best feature subsets. The filter model considers statistical characteristics of dataset directly without involving any learning algorithms. Due to the computational efficiency, the filter method is usually used to select features from high-dimensional datasets, such as intrusion detection systems. Moreover, this method allows to estimate feature subsets not only by their relevance, but also by the relationships between features that make certain features redundant. A major challenge in the IDS feature selection process is to choose appropriate measures that can precisely determine the relevance and the relationship between features of given dataset.

Since the relevance and the relationship are usually characterized in terms of correlation or mutual information [2,3], we focus on the recently proposed generic feature selection (GeFS) measure for intrusion detection [4]. This measure consists of two instances that belong to the filter model from machine learning: the correlation feature selection (CFS) measure and the minimal-redundancy-maximal-relevance (mRMR) measure. In given dataset, if there are many features that are linearly correlated to each other, then the CFS measure is recommended for selecting features. Otherwise, the mRMR measure is alternatively chosen as it considers non-linear relations through the analysis of mutual information between the features. The GeFS measure was successfully tested on the KDD CUP 1999 benchmarking dataset for IDS [9]. However, this dataset is out of date and it was heavily criticized by the IDS community (see, for example [7]). Moreover, the KDD CUP 1999 dataset does not contain enough HTTP traffic for training and testing WAFs and the Web attacks of this dataset are not representative for currently existing Web attacks. Therefore, the question about the performance of the GeFS measure perform in Web attack detection is still open.

In this paper, we propose to use the GeFS measure for selecting features in Web attack detection. We conducted experiments on the ECML/PKDD 2007 dataset, which was generated for the ECML/PKDD 2007 Discovery Challenge [6]. However, the attack requests of this dataset were constructed blindly [6] and did not target any real Web application. Therefore, we additionally generated our new CSIC 2010 dataset, which contains the traffic directed to an e-commerce Web application. From our expert knowledge about Web attacks, we listed 30 features that we considered relevant for the detection process. Then, we extracted the values of these 30 relevant features from the datasets. By applying the GeFS measure, we wanted to know within the particular datasets which features among the 30 extracted features are the most important for the Web attack detection process. In order to do that, we analyzed the statistical properties of the datasets to see whether they had linear correlation or non-linear relations between features. To do that, the data points of the datasets were visualized in the two-dimensional space and the correlation coefficients were computed. We then chose the CFS measure for feature selection from the CSIC 2010 dataset and the mRMR measure for the ECML/PKDD 2007 dataset. The detection accuracies obtained after the feature selection by means of four different classifiers were tested. The experiments show that by using appropriate instances of the GeFS measure, we could remove 63% of irrelevant and redundant features from the original dataset, while reducing only 0.12% the detection accuracy of WAFs.

The paper is organized as follows. Section 2 describes the generic feature selection (GeFS) measure for intrusion detection and its instances in more detail. Section 3 shows our experiments on the CSIC 2010 dataset and the ECML/PKDD 2007 dataset. The last section summarizes the findings.

Generic Feature Selection for Intrusion Detection

In this subsection, we give an overview of the generic feature selection (GeFS) measure together with two instances applied in intrusion detection: the correlation-feature-selection (CFS) measure and the minimal-redundancy-maximal-relevance (mRMR) measure [4].

Definition: The feature selection problem by means of the generic feature selection (GeFS) measure is to findthat maximizes the function GeFS(x):

In this definition, binary values of the variable xi indicate the appearance (xi = 1) or the absence (xi = 0) of the feature fi; a0, b0 are constants; Ai (x), Bi (x) are linear functions of variables x1,…,xn.

The Correlation Feature Selection (CFS) Measure: This measure characterizes the relevance of features and their relationships in terms of the linear correlation. For a given dataset, if there are many features that are linearly correlated to each other, the CFS measure is recommended for selecting features. In [4], it was shown that the CFS measure is an instance of the GeFS measure.

The minimal-Redundancy-Maximal-Relevance (mRMR) Measure: The mRMR measure considers non-linear relations through the analysis of mutual information between the features. Therefore, it was recommended for selecting features from datasets that have many non-linearly correlated features. It was also shown that the mRMR measure belongs to the generic feature selection (GeFS) measure [4].

The feature selection problem (1) can be solved by means of the optimization approach proposed in [4]. The main idea is that the problem (1) is transformed into mixed 0-linear programming problem, which can be solved by using the branch and bound algorithm.

The search strategy for obtaining subsets of relevant features by means of the GeFS measure is:

Step 1: Analyze the statistical properties of the given dataset in order to choose the appropriate feature selection instance (CFS or mRMR) from the generic feature selection measure GeFS. We choose the CFS measure if the dataset has many features that are linearly correlated to the class label and to each other. Otherwise, the mRMR measure is chosen.

Step 2: According to the choice from Step 1, construct the optimization problem (1) for the CFS measure or for the mRMR measure. In this step, we can use expert knowledge by assigning the value to the variable if the feature is relevant and the value 0 otherwise.

Step 3: Transform the optimization problem of the GeFS measure to a mixed 0-linear programming (M01LP) problem, which is to be solved by means of the branch-and-bound algorithm. A non-zero integer value of xi from the optimal solution x indicates the relevance of the feature fi regarding the GeFS measure.

Experiment

In this section, we show the application of the generic feature selection (GeFS) measure in Web attack detection. We first describe two datasets, on which the experiments were conducted: the ECML/PKDD 2007 dataset [6] and our new CSIC 2010 dataset. We then discuss the 30 features that we consider relevant for Web attack detection. We analyze the statistical properties of these datasets containing the 30 extracted features to choose appropriate instances from the GeFS measure. Since there is no standard Web application firewall (WAF), we apply four different machine learning algorithms to evaluate the detection accuracy on datasets containing the selected features.

Data Sets

We conducted experiments on the ECML/PKDD 2007 dataset, which was generated for the ECML/PKDD 2007 Discovery Challenge [6]. In fact, we used the training set, which is composed of 50,000 samples including 20% of attacks (i.e. 10,000 attacks and 40,000 normal requests). The requests are labeled with specifications of attack classes or normal traffic. The classes of attacks in this dataset are: Cross-Site Scripting, SQL Injection, LDAP Injection, XPATH Injection, Path traversal, Command Execution and SSI attacks. However, the attack requests of this dataset were constructed blindly and did not target any real Web application. Therefore, we additionally generated our new CSIC 2010 dataset for experiments.

The CSIC 2010 dataset contains the generated traffic targeted to an ecommerce Web application developed at our department. In this web application, users can buy items using shopping cart and register by providing some personal information. The dataset was generated automatically and contains 36,000 normal requests and more than 25,000 anomalous requests. In this data set the requests are labeled as normal or anomalous. We included attacks such as SQL injection, buffer overflow, information gathering, files disclosure, CRLF injection, XSS, server side include, parameter tampering and so on. In order to generate the traffic, we collected thousands of normal and anomalous values for the parameters of the web application. Then, we generated requests for every web-page and the values of the parameters, if any, were filled with the values collected (the normal values for the normal traffic and the anomalous ones for the anomalous traffic). Further details can be found in [5].

Table 1. Names of 30 features that are considered relevant for the detection of Web attacks. * refers to features selected by the CFS from the CSIC-2010 dataset; t refers to features selected by the mRMR from the CSIC 2010 dataset; refers to features • selected by the CFS from the ECML/PKDD 2007 dataset; and ◊ refers to features selected by the mRMR from the ECML/PKDD 2007 dataset.

Feature Name	Feature Name
Length of the request *	Length of the path *
Length of the arguments *	Length of the header "Accept"
Length of the header "Accept-Encoding"	Length of the header "Accept-Charset"
Length of the header "Accept-Language"	Length of the header "Cookie"
Length of the header "Content-Length"	Length of the header "Content-Type"
Length of the Host	Length of the header "Referer"
Length of the header "User-Agent"	Method identifier
Number of arguments *	Number of letters in the arguments *
Number of digits in the arguments *	Number of ‘special’ char in the arguments * •
Number of other char in the arguments •	Number of letters char in the path *
Number of digits in the path *	Number of ‘special’ char in the path *
Number of other char in path	Number of cookies
Minimum byte value in the request	Maximum byte value in the request *
Number of distinct bytes	Entropy
Number of keywords in the path	Number of keywords in the arguments

Experimental Settings

From our expert knowledge about Web attacks, we listed 30 features that we considered relevant for the detection process (see Table 1). Some features refer to the length of the request, the length of the path or the headers, as length is important for detecting buffer-overflow attacks. From our expert knowledge, we observed that the non-alphanumeric characters were present in many injection attacks. Therefore, we considered four types of characters: letters, digits, non-alphanumeric characters which have a special meaning in a set of programming languages (in Table 1 we refer to them as ‘special’ char) and other characters.

We analyzed their appearances in the path and in the argument’s values. We also studied the entropy of the bytes in the request. Additionally, we collected the keywords of several programming languages that were often used in the injection attacks and counted the number of their appearances in different parts of the request as a feature.

Then, we extracted the values of these 30 relevant features from the CSIC 2010 and from the ECML/PKDD 2007 datasets and analyzed their statistical properties to see whether they had linear or non-linear relations between features. From this analysis, the appropriate feature selection instance from the GeFS measure was chosen for each dataset according to the Step 1 of the search method described above. In order to do that, we first visualized the whole datasets in the two-dimensional space to get a plot matrix, in which each element was the distribution of data points depending on the values of feature and the class label or the values of two features. For instance, Fig.1 and Fig. 2 show the sample distributions of data points of the CSIC 2010 dataset and the ECML/PKDD 2007 dataset, respectively. We then calculated the correlation coefficients between the features. From these, we observed that the CSIC 2010 dataset had many features that were linearly correlated to each other, whereas in the ECML/PKDD 2007 dataset the non-linear relations between features were more representative. In fact, in the CSIC 2010 dataset, more than 63 % of the correlation coefficients were greater than 0.5, whereas in the ECML/PKDD 2007 dataset more than 83% of the correlation coefficients were less than 0.09. Therefore, we chose the CFS measure for selecting features from the CSIC 2010 dataset, and the mRMR measure for selecting features from the ECML/PKDD 2007 dataset. Moreover, the CFS and the mRMR measures were also applied to the ECML/PKDD 2007 and to the CSIC 2010 datasets, respectively, to see how the wrong choice of feature selection methods would negatively affect the detection performance.

Fig. 1. Sample distribution of data points of the CSIC 2010 dataset

Fig. 2. Sample distribution of data points of the ECML/PKDD 2007 dataset

We applied the optimization algorithm proposed in [4] to find globally optimal feature subsets by means of the CFS and the mRMR measures. Four classifiers with 10fold cross validation were used to evaluate detection performances before and after feature selection: C4.5, CART, RandomTree and RandomForest [1]. All the obtained results are given in the Tables 2, 3 and 4.

Experimental Results

Table 2 shows the number of full-set features before feature selection and the number of features selected by the CFS measure and the mRMR measure (Table 1 shows which features were selected). Table 3 and Table 4 summarize the detection rate as well as the false positive rate of four different classifiers performed on the CSIC 2010 dataset and the ECML/PKDD 2007 dataset, respectively.

Table 2. Full-set features and the number of selected features

Data Set	Full-set	CFS	mRMR
CSIC 2010	30	11	14
ECML/PKDD 2007	30	2	6

Table 3. Experimental results on the CSIC 2010 dataset

Classifiers	Detection rate			False Positive Rate
	Full-set	CFS	mRMR	Full-set	CFS	mRMR
C4.5	94.49	94.06	79.80	5.9	6.8	25.7
CART	94.12	93.71	79.85	6.2	6.8	25.3
Random Tree	92.30	92.70	71.36	8.3	7.8	30.6
Random Forest	93.71	93.68	71.70	7.2	7.2	30.5
Average	93.65	93.53	75.67	6.9	7.1	28

Table 4. Experimental results on the ECML/PKDD 2007 dataset

Classifiers	Detection rate			False Positive Rate
	Full-set	CFS	mRMR	Full-set	CFS	mRMR
C4.5	96.37	86.45	91.62	3.7	17.6	9.9
CART	96.11	86.45	91.54	4.3	17.6	10
Random Tree	96.89	86.39	93.41	2.6	17.7	6.4
Random Forest	98.80	86.39	95.18	1.2	17.7	5.0
Average	97.04	86.42	92.93	2.95	17.6	7.8

It can be observed from Table 2 and Table 3 that the CFS measure performed well on the CSIC 2010 dataset and gave better results than the mRMR measure. In fact, the CFS measure removed the number of irrelevant and redundant features from the data-set by more than 63%, while reducing very slightly (only 0.12%) the detection accuracy. In this case, the mRMR measure gave much worse results in comparison with the full-set features.

From Table 2 and Table 4, it can be seen that the mRMR measure removed 80% of irrelevant and redundant features from the ECML/PKDD 2007 dataset, whereas the detection accuracies were a bit lower than the ones obtained with the full-set feature. The CFS measure did not work well in this case.

Therefore, based on all these experiments we can say that the effectiveness of WAFs would be improved by choosing and using appropriate feature selection methods of the GeFS measure.

Conclusions

We have proposed to use the generic feature selection (GeFS) measure for Web attack detection. We analyzed statistical properties of the new generated CSIC 2010 dataset and the ECML/PKDD 2007 dataset. Based on this analysis, the CFS measure and the mRMR measure were chosen for selecting features from the CSIC 2010 dataset and the ECML/PKDD 2007 dataset, respectively. The detection accuracies obtained after the feature selection by means of four different classifiers were tested. The experiments show that by choosing appropriate instances of the GeFS measure, we could remove 63% of irrelevant and redundant features from the original dataset, while reducing only 0.12% the detection accuracy of WAFs.