Information Technology Reference
In-Depth Information
are linearly correlated to the class label and to each other. Otherwise, the mRMR
measure is chosen.
- Step 2 : According to the choice from Step 1, construct the optimization problem
(1) for the CFS measure or for the mRMR measure. In this step, we can use expert
knowledge by assigning the value to the variable if the feature is relevant and the
value 0 otherwise.
- Step 3 : Transform the optimization problem of the GeFS measure to a mixed
0-linear programming (M01LP) problem, which is to be solved by means of the
branch-and-bound algorithm. A non-zero integer value of
x from the optimal solu-
tion x indicates the relevance of the feature
i f regarding the GeFS measure.
3 Experiment
In this section, we show the application of the generic feature selection (GeFS) meas-
ure in Web attack detection. We first describe two datasets, on which the experiments
were conducted: the ECML/PKDD 2007 dataset [6] and our new CSIC 2010 dataset.
We then discuss the 30 features that we consider relevant for Web attack detection.
We analyze the statistical properties of these datasets containing the 30 extracted
features to choose appropriate instances from the GeFS measure. Since there is no
standard Web application firewall (WAF), we apply four different machine learning
algorithms to evaluate the detection accuracy on datasets containing the selected
features.
3.1 Data Sets
We conducted experiments on the ECML/PKDD 2007 dataset, which was generated
for the ECML/PKDD 2007 Discovery Challenge [6]. In fact, we used the training set,
which is composed of 50,000 samples including 20% of attacks (i.e. 10,000 attacks
and 40,000 normal requests). The requests are labeled with specifications of attack
classes or normal traffic. The classes of attacks in this dataset are: Cross-Site Script-
ing, SQL Injection, LDAP Injection, XPATH Injection, Path traversal, Command
Execution and SSI attacks. However, the attack requests of this dataset were con-
structed blindly and did not target any real Web application. Therefore, we addition-
ally generated our new CSIC 2010 dataset for experiments.
The CSIC 2010 dataset contains the generated traffic targeted to an ecommerce
Web application developed at our department. In this web application, users can buy
items using shopping cart and register by providing some personal information. The
dataset was generated automatically and contains 36,000 normal requests and more
than 25,000 anomalous requests. In this data set the requests are labeled as normal or
anomalous. We included attacks such as SQL injection, buffer overflow, information
gathering, files disclosure, CRLF injection, XSS, server side include, parameter tam-
pering and so on. In order to generate the traffic, we collected thousands of normal
and anomalous values for the parameters of the web application. Then, we generated
requests for every web-page and the values of the parameters, if any, were filled with
the values collected (the normal values for the normal traffic and the anomalous ones
for the anomalous traffic). Further details can be found in [5].
Search WWH ::




Custom Search