Collective Classification for Spam Filtering - Computational Intelligence in Security for Information Systems

Information Technology Reference

In-Depth Information

are linearly correlated to the class label and to each other. Otherwise, the mRMR

measure is chosen.

- Step 2 : According to the choice from Step 1, construct the optimization problem

(1) for the CFS measure or for the mRMR measure. In this step, we can use expert

knowledge by assigning the value to the variable if the feature is relevant and the

value 0 otherwise.

- Step 3 : Transform the optimization problem of the GeFS measure to a mixed

0-linear programming (M01LP) problem, which is to be solved by means of the

branch-and-bound algorithm. A non-zero integer value of

x from the optimal solu-

tion x indicates the relevance of the feature

i f regarding the GeFS measure.

3 Experiment

In this section, we show the application of the generic feature selection (GeFS) meas-

ure in Web attack detection. We first describe two datasets, on which the experiments

were conducted: the ECML/PKDD 2007 dataset [6] and our new CSIC 2010 dataset.

We then discuss the 30 features that we consider relevant for Web attack detection.

We analyze the statistical properties of these datasets containing the 30 extracted

features to choose appropriate instances from the GeFS measure. Since there is no

standard Web application firewall (WAF), we apply four different machine learning

algorithms to evaluate the detection accuracy on datasets containing the selected

features.

3.1 Data Sets

We conducted experiments on the ECML/PKDD 2007 dataset, which was generated

for the ECML/PKDD 2007 Discovery Challenge [6]. In fact, we used the training set,

which is composed of 50,000 samples including 20% of attacks (i.e. 10,000 attacks

and 40,000 normal requests). The requests are labeled with specifications of attack

classes or normal traffic. The classes of attacks in this dataset are: Cross-Site Script-

ing, SQL Injection, LDAP Injection, XPATH Injection, Path traversal, Command

Execution and SSI attacks. However, the attack requests of this dataset were con-

structed blindly and did not target any real Web application. Therefore, we addition-

ally generated our new CSIC 2010 dataset for experiments.

The CSIC 2010 dataset contains the generated traffic targeted to an ecommerce

Web application developed at our department. In this web application, users can buy

items using shopping cart and register by providing some personal information. The

dataset was generated automatically and contains 36,000 normal requests and more

than 25,000 anomalous requests. In this data set the requests are labeled as normal or

anomalous. We included attacks such as SQL injection, buffer overflow, information

gathering, files disclosure, CRLF injection, XSS, server side include, parameter tam-

pering and so on. In order to generate the traffic, we collected thousands of normal

and anomalous values for the parameters of the web application. Then, we generated

requests for every web-page and the values of the parameters, if any, were filled with

the values collected (the normal values for the normal traffic and the anomalous ones

for the anomalous traffic). Further details can be found in [5].

Computational Intelligence in Security for Information Systems

Search WWH ::

Custom Search

Home