An Immunological Filter for Spam - Artificial Immune Systems - page 455

Information Technology Reference

In-Depth Information

The well-known na ıve Bayes was chosen to be the filter for comparison due

to its wide application in the context of spam filtering [1,4,6].

In Tables 1 and 2, the performance indexes for na ıve Bayes and SRABNET are

presented. The values are the average of the 30 runs and the symbol (

±

)meansthe

standard deviation. As the entire dataset is mixed at the beginning of the algorithm

to promote ten-fold cross validation, there is nothing to hinder that at least one fold

have only legitimate or only spam in it. In this case, we did not use these values to

compute the average. It is important to stress that this 'peculiarity' just occurs with

the na ıve Bayes filter. This occurs mainly because the na ıve Bayes filter uses only

the samples from training set to calculate the probability of a sample be a spam or

not. With this, if the training set have just legitimate messages the value attributed

to the probability of a message be spam is strongly affected. In this scenario all the

messages will be classified as legitimate.

Table 1. Performance Measures with λ =9

Filter

Spam Recall (%) Spam Precision (%) WA c c ( % )

TCR

naıve Bayes

14.17

73.05

34.45 ± 1.71

1.08 ± 0.4

SRABNET

85.90

97.37

97.18 ±

0.14 2.85 ±

0.02

Table 2. Performance Measures with λ = 999

Filter

Spam Recall (%) Spam Precision (%) WA c c ( % )

TCR

naıve Bayes

14.38

72.16

35.61 ± 1.77

0.05 ± 0.06

SRABNET

60.21

97.73

98.38 ± 0.09 0.07 ± 0.001

For λ = 999, both filters score TCR < 1, this is probably due to the very high

weight given to false positives ( L

S ). As a result, none of the filters manages to

eliminate these errors completely. That is, higher values of λ benefits the baseline

filter (without one), once that no false positives occurs. Despite theses results,

SRABNET still remains as the best filter keeping into consideration WAcc and

even TCR.

For λ = 9, both filters reach a TCR > 1, with the antibody network clearly

overcoming the na ıve Bayes filter. This is mainly due the fact that the immune

algorithm does not make any assumption on the independence of the attributes,

allowing a better positioning of the prototypes (antibodies) on the feature space.

The poor performance of the na ıve Bayes in all values of λ can be attributed

to the method applied here to reduce dimensionality. The concise information

that remains in the feature vector, may probably deceive the Bayesian classifier.

→

6.2

Future Work

Further important analysis includes corpora where the data (messages) have a

temporal sequence. Some experiments, with artificial datasets have already been

Next Page

Artificial Immune Systems

Search WWH ::

Custom Search

Home