Collective Classification for Spam Filtering - Computational Intelligence in Security for Information Systems

Information Technology Reference

In-Depth Information

cannot use any of the standard distance measurements. The distance between the in-

stances of the presented model is taken from [10]. It is designed to calculate the distance

between two sequences. We have elected this one (among all given in [10]) since it is

proven to be the most efficient in the terms of the absolute execution time. The de-

ployed distance function is actually is equivalent to Manhattan distance after making the

following assumption: the feature that does not exist in the first vector while exists in

the second (and vice versa) actually exists with the value equal to 0, since we can say

that it occurs with 0 frequency. In this way, we get two vectors of the same size and the

distance between the centre and an input is between 0 (the vectors have the same fea-

tures with the same feature values) and 2 (the vectors have different features with the

values greater than 0). In the same way, if the set of the features of one is the subset of

the feature set of the other, the distance will be between 0 and 1.

In many situations this can result in having huge number of combinations. In this

case, it is necessary to apply one of the following possibilities for reducing this num-

ber. One possibility is to divide the range [0,100] into few equidistant ranges (usually

three to five), and assign a unique value or meaning to all the values that belong to

one range. This significantly reduces the number of possible k -grams. Another possi-

bility is to take an average of the values that belong to a certain range.

Table 2. The characterization of the previous example

)HDWXUHV

2FFXUUHQFH )UHTXHQF\

3.2 Detection and Isolation of Bad Mouthing Attack

As previously mentioned we treat attacks as data outliers and deploy clustering tech-

niques. In this work we will use the self-organizing maps (SOM) algorithm, as they

are relatively fast and inexpensive when the dimensionality of the data is huge, which

can happen in our case.

There are two possible approaches for detecting outliers using clustering techniques

[11] depending on the following two possibilities: detecting outlying clusters or detect-

ing outlying data that belong to non-outlying clusters. For the first case, we calculate the

average distance of each node to the rest of the nodes (or its closest neighborhood)

( MD ). In the latter case, we calculate quantization error ( QE ) of each input as the dis-

tance from its group center. If we train the SOM algorithm with clean data, it is obvious

that we will have the second scenario. On the other hand, if the traces of attacks existed

during the training, both situations are possible. Thus, we can detect attacks in the situa-

tion we do or do not have the traces of attacks during the training, which means that we

do not have any restrictions on the training data. This further means that we avoid time

consuming and error prone process of pre-processing the training data.

In the first step we examine the recommendations for each node in order to find the

inconsistencies. Having in mind that the attacks will often result in creating new k -

grams, it is reasonable to assume that the extracted vector in the presence of attackers

will not be a subset of any vector extracted in normal situation, thus the distance will

Search WWH ::

Custom Search

Home