Information Technology Reference
In-Depth Information
3
The Corpus Used
The corpus that will be used to validate the proposal is the PU1 [4] 2 corpus and
consists of 1099 messages, with spam rate of 43.77%, divided as follows:
- 481 spam messages. These are all the spam messages over a period of 22
months, excluding non-English messages and duplicates of spam messages
sent on the same day.
- 618 legitimate messages, all in English, over a period of 36 months.
All the messages have header fields and HTML tags removed, leaving only sub-
ject line and mail body text, resulting in 24,748 words in total vocabulary. Each
token was mapped to a unique integer to ensure the privacy of the content. There
are four versions of this dataset: with or without stemming and with or without
stop-word removal. Stop-word removal is a procedure to remove most frequent
used words as 'and, for, a' and the stemming is the process of reducing a word
to its root form (e.g. 'learner' becomes 'learn'). These methods are used mainly
to reduce the dimensionality of feature space aiming at improving the classifier's
prediction. However, Androutsopoulos et al. [4] demonstrated that stop-word
removal and stemming may not promote a statistically significant improvement.
That is why we have adopted in the experiments to be presented, the version
without stemming and stop-word removal, although we have considered a simple
procedure to dimensionality reduction aiming at alleviating the data sparseness
problem.
4
Pre-processing Stage
The pre-processing is an important step in all pattern recognition and infor-
mation retrieval task. In this stage, the dataset and the samples inside it are
turned into some interpretable pattern for the system that will learn from them.
Here, we have conceived this step as the development of a representation for the
samples (Section 4.1) and the reduction of the number of attributes (Section 4.2).
4.1
Messages Representation
The first stage of the design of representation is to define how the messages
will be encoded. Each individual message can be represented as a binary vector
denoting which features were present or absent in the message. This is frequently
referred to as the bag of words approach. A feature in this context is a word, w i ,
and each message, x d , is represented as depicted in Eq. 3, where i is the number
of words of the entire corpus and d is the number of documents or messages of
the dataset.
x m = w m 1 ,w m 2 ,...,w mi
m =1 , 2 ,...,d
(3)
2 The PU corpora may be downloaded from http://www.iit.demokritos.gr/skel/i-
config/.
Search WWH ::




Custom Search