Information Technology Reference
In-Depth Information
The well-known na ıve Bayes was chosen to be the filter for comparison due
to its wide application in the context of spam filtering [1,4,6].
In Tables 1 and 2, the performance indexes for na ıve Bayes and SRABNET are
presented. The values are the average of the 30 runs and the symbol (
±
)meansthe
standard deviation. As the entire dataset is mixed at the beginning of the algorithm
to promote ten-fold cross validation, there is nothing to hinder that at least one fold
have only legitimate or only spam in it. In this case, we did not use these values to
compute the average. It is important to stress that this 'peculiarity' just occurs with
the na ıve Bayes filter. This occurs mainly because the na ıve Bayes filter uses only
the samples from training set to calculate the probability of a sample be a spam or
not. With this, if the training set have just legitimate messages the value attributed
to the probability of a message be spam is strongly affected. In this scenario all the
messages will be classified as legitimate.
Table 1. Performance Measures with λ =9
Filter
Spam Recall (%) Spam Precision (%) WA c c ( % )
TCR
naıve Bayes
14.17
73.05
34.45 ± 1.71
1.08 ± 0.4
SRABNET
85.90
97.37
97.18 ±
0.14 2.85 ±
0.02
Table 2. Performance Measures with λ = 999
Filter
Spam Recall (%) Spam Precision (%) WA c c ( % )
TCR
naıve Bayes
14.38
72.16
35.61 ± 1.77
0.05 ± 0.06
SRABNET
60.21
97.73
98.38 ± 0.09 0.07 ± 0.001
For λ = 999, both filters score TCR < 1, this is probably due to the very high
weight given to false positives ( L
S ). As a result, none of the filters manages to
eliminate these errors completely. That is, higher values of λ benefits the baseline
filter (without one), once that no false positives occurs. Despite theses results,
SRABNET still remains as the best filter keeping into consideration WAcc and
even TCR.
For λ = 9, both filters reach a TCR > 1, with the antibody network clearly
overcoming the na ıve Bayes filter. This is mainly due the fact that the immune
algorithm does not make any assumption on the independence of the attributes,
allowing a better positioning of the prototypes (antibodies) on the feature space.
The poor performance of the na ıve Bayes in all values of λ can be attributed
to the method applied here to reduce dimensionality. The concise information
that remains in the feature vector, may probably deceive the Bayesian classifier.
6.2
Future Work
Further important analysis includes corpora where the data (messages) have a
temporal sequence. Some experiments, with artificial datasets have already been
Search WWH ::




Custom Search