Information Technology Reference
In-Depth Information
cannot use any of the standard distance measurements. The distance between the in-
stances of the presented model is taken from [10]. It is designed to calculate the distance
between two sequences. We have elected this one (among all given in [10]) since it is
proven to be the most efficient in the terms of the absolute execution time. The de-
ployed distance function is actually is equivalent to Manhattan distance after making the
following assumption: the feature that does not exist in the first vector while exists in
the second (and vice versa) actually exists with the value equal to 0, since we can say
that it occurs with 0 frequency. In this way, we get two vectors of the same size and the
distance between the centre and an input is between 0 (the vectors have the same fea-
tures with the same feature values) and 2 (the vectors have different features with the
values greater than 0). In the same way, if the set of the features of one is the subset of
the feature set of the other, the distance will be between 0 and 1.
In many situations this can result in having huge number of combinations. In this
case, it is necessary to apply one of the following possibilities for reducing this num-
ber. One possibility is to divide the range [0,100] into few equidistant ranges (usually
three to five), and assign a unique value or meaning to all the values that belong to
one range. This significantly reduces the number of possible k -grams. Another possi-
bility is to take an average of the values that belong to a certain range.
Table 2. The characterization of the previous example
)HDWXUHV
2FFXUUHQFH )UHTXHQF\
3.2 Detection and Isolation of Bad Mouthing Attack
As previously mentioned we treat attacks as data outliers and deploy clustering tech-
niques. In this work we will use the self-organizing maps (SOM) algorithm, as they
are relatively fast and inexpensive when the dimensionality of the data is huge, which
can happen in our case.
There are two possible approaches for detecting outliers using clustering techniques
[11] depending on the following two possibilities: detecting outlying clusters or detect-
ing outlying data that belong to non-outlying clusters. For the first case, we calculate the
average distance of each node to the rest of the nodes (or its closest neighborhood)
( MD ). In the latter case, we calculate quantization error ( QE ) of each input as the dis-
tance from its group center. If we train the SOM algorithm with clean data, it is obvious
that we will have the second scenario. On the other hand, if the traces of attacks existed
during the training, both situations are possible. Thus, we can detect attacks in the situa-
tion we do or do not have the traces of attacks during the training, which means that we
do not have any restrictions on the training data. This further means that we avoid time
consuming and error prone process of pre-processing the training data.
In the first step we examine the recommendations for each node in order to find the
inconsistencies. Having in mind that the attacks will often result in creating new k -
grams, it is reasonable to assume that the extracted vector in the presence of attackers
will not be a subset of any vector extracted in normal situation, thus the distance will
 
Search WWH ::




Custom Search