Information Technology Reference
In-Depth Information
Table 1. Names of 30 features that are considered relevant for the detection of Web attacks. *
refers to features selected by the CFS from the CSIC-2010 dataset; † refers to features selected
by the mRMR from the CSIC 2010 dataset; refers to features ￿ selected by the CFS from the
ECML/PKDD 2007 dataset; and ◊ refers to features selected by the mRMR from the
ECML/PKDD 2007 dataset.
Feature Name
Feature Name
Length of the request
Length of the path
Length of the arguments
Length of the header “Accept”
Length of the header “Accept-Encoding” Length of the header “Accept-Charset”
Length of the header “Accept-Language”
Length of the header “Cookie”
Length of the header “Content-Length”
Length of the header “Content-Type”
Length of the Host
Length of the header “Referer”
Length of the header “User-Agent”
Method identifier
Number of arguments
Number of letters in the arguments
Number of digits in the arguments Number of 'special' char in the arguments †•
Number of other char in the arguments Number of letters char in the path
Number of digits in the path
Number of 'special' char in the path
Number of other char in path
Number of cookies
Minimum byte value in the request
Maximum byte value in the request
Number of distinct bytes
Entropy
Number of keywords in the path
Number of keywords in the arguments
3.2 Experimental Settings
From our expert knowledge about Web attacks, we listed 30 features that we consid-
ered relevant for the detection process (see Table 1). Some features refer to the length
of the request, the length of the path or the headers, as length is important for detect-
ing buffer-overflow attacks. From our expert knowledge, we observed that the
non-alphanumeric characters were present in many injection attacks. Therefore, we
considered four types of characters: letters, digits, non-alphanumeric characters which
have a special meaning in a set of programming languages (in Table 1 we refer to
them as 'special' char) and other characters.
We analyzed their appearances in the path and in the argument's values. We also
studied the entropy of the bytes in the request. Additionally, we collected the key-
words of several programming languages that were often used in the injection attacks
and counted the number of their appearances in different parts of the request as a
feature.
Then, we extracted the values of these 30 relevant features from the CSIC 2010
and from the ECML/PKDD 2007 datasets and analyzed their statistical properties to
see whether they had linear or non-linear relations between features. From this
analysis, the appropriate feature selection instance from the GeFS measure was cho-
sen for each dataset according to the Step 1 of the search method described above. In
order to do that, we first visualized the whole datasets in the two-dimensional space to
get a plot matrix, in which each element was the distribution of data points depending
on the values of feature and the class label or the values of two features. For instance,
Fig.1 and Fig. 2 show the sample distributions of data points of the CSIC 2010 dataset
and the ECML/PKDD 2007 dataset, respectively. We then calculated the correlation
 
Search WWH ::




Custom Search