An Immunological Filter for Spam - Artificial Immune Systems

Information Technology Reference

In-Depth Information

3

The Corpus Used

The corpus that will be used to validate the proposal is the PU1 [4] 2 corpus and

consists of 1099 messages, with spam rate of 43.77%, divided as follows:

- 481 spam messages. These are all the spam messages over a period of 22

months, excluding non-English messages and duplicates of spam messages

sent on the same day.

- 618 legitimate messages, all in English, over a period of 36 months.

All the messages have header fields and HTML tags removed, leaving only sub-

ject line and mail body text, resulting in 24,748 words in total vocabulary. Each

token was mapped to a unique integer to ensure the privacy of the content. There

are four versions of this dataset: with or without stemming and with or without

stop-word removal. Stop-word removal is a procedure to remove most frequent

used words as 'and, for, a' and the stemming is the process of reducing a word

to its root form (e.g. 'learner' becomes 'learn'). These methods are used mainly

to reduce the dimensionality of feature space aiming at improving the classifier's

prediction. However, Androutsopoulos et al. [4] demonstrated that stop-word

removal and stemming may not promote a statistically significant improvement.

That is why we have adopted in the experiments to be presented, the version

without stemming and stop-word removal, although we have considered a simple

procedure to dimensionality reduction aiming at alleviating the data sparseness

problem.

4

Pre-processing Stage

The pre-processing is an important step in all pattern recognition and infor-

mation retrieval task. In this stage, the dataset and the samples inside it are

turned into some interpretable pattern for the system that will learn from them.

Here, we have conceived this step as the development of a representation for the

samples (Section 4.1) and the reduction of the number of attributes (Section 4.2).

4.1

Messages Representation

The first stage of the design of representation is to define how the messages

will be encoded. Each individual message can be represented as a binary vector

denoting which features were present or absent in the message. This is frequently

referred to as the bag of words approach. A feature in this context is a word, w i ,

and each message, x d , is represented as depicted in Eq. 3, where i is the number

of words of the entire corpus and d is the number of documents or messages of

the dataset.

x m = w m 1 ,w m 2 ,...,w mi

m =1 , 2 ,...,d

(3)

2 The PU corpora may be downloaded from http://www.iit.demokritos.gr/skel/i-

config/.

Artificial Immune Systems

Search WWH ::

Custom Search

Home