Collective Classification for Spam Filtering - Computational Intelligence in Security for Information Systems

Information Technology Reference

In-Depth Information

To calculate the class distribution of a complete set or a subset, the weights are

summed up according to the weights in the training set, and then normalised.

The nominal attribute distribution corresponds to the normalised sum of weights

for each distinct value and, for the numeric attribute, distribution of the binary

splitbasedonmedianiscalculatedandthentheweightsaresummedupforthe

two bins and finally normalised.

2.4 RandomWoods

It works like WEKA's classic RandomForest but using CollectiveBagging (classic

Bagging, a machine learning ensemble meta-algorithm to improve stability and

classification accuracy, extended to make it available to collective classifiers)

in combination with CollectiveTree in contrast to RandomForest, which uses

Bagging and RandomTree.

3 Empirical Evaluation

To evaluate the collective algorithms we used the Ling Spam 3 and SpamAssas-

sin 4 datasets. Ling Spam consists of a mixture of both spam and legitimate

messages retrieved from the Linguistic list , an e-mail distribution list about lin-

guistics . It comprises 2,893 different e-mails, of which 2,412 are legitimate e-mails

obtained by downloading digests from the list and 481 are spam e-mails retrieved

from one of the authors of the corpus (for a more detailed description of the cor-

pus please refer to [14,15]). From the 4 different datasets provided in this corpus,

each of one with different pre-process steps, we choose the Bare dataset, which

has no pre-processing.

The SpamAssassin public mail corpus is a selection of 1,897 spam messages

and 4,150 legitimate e-mails. Unfortunatelly, due to computational restrictions

we were obliged to reduce the dataset to a 50%, so the final used dataset com-

prises 3,023 e-mails, of which 964 are spam e-mails and 2,059 are legitimate

messages.

In addition, we performed for both datasets a Stop Word Removal [16] based

on an external stop-word list 5 and removed any non alpha-numeric character.

We then used the Vector Space Model (VSM) [17], an algebraic approach for

Information Filtering (IF), Information Retrieval (IR), indexing and ranking,

to create the model. This model represents natural language documents in a

mathematical manner through vectors in a multidimensional space.

We extracted the top 1,000 attributes using Information Gain [18], an algo-

rithm that evaluates the relevance of an attribute by measuring the information

gain with respect to the class: IG ( j )= v j ∈ R

C i P ( v j ,C i )

·

P ( C i )) where C i is the i -th class, v j is the value of the j -th interpretation,

P ( v j ,C i ) is the probability that the j -th attribute has the value v j

·

( P ( v j ,C i ) / ( P ( v j )

in the class

3 http://nlp.cs.aueb.gr/software and datasets/lingspam public.tar.gz

4 http://spamassassin.apache.org/publiccorpus/

5 http://www.webconfs.com/stop-words.php

Computational Intelligence in Security for Information Systems

Search WWH ::

Custom Search

Home