Information Technology Reference
In-Depth Information
To calculate the class distribution of a complete set or a subset, the weights are
summed up according to the weights in the training set, and then normalised.
The nominal attribute distribution corresponds to the normalised sum of weights
for each distinct value and, for the numeric attribute, distribution of the binary
splitbasedonmedianiscalculatedandthentheweightsaresummedupforthe
two bins and finally normalised.
2.4 RandomWoods
It works like WEKA's classic RandomForest but using CollectiveBagging (classic
Bagging, a machine learning ensemble meta-algorithm to improve stability and
classification accuracy, extended to make it available to collective classifiers)
in combination with CollectiveTree in contrast to RandomForest, which uses
Bagging and RandomTree.
3 Empirical Evaluation
To evaluate the collective algorithms we used the Ling Spam 3 and SpamAssas-
sin 4 datasets. Ling Spam consists of a mixture of both spam and legitimate
messages retrieved from the Linguistic list , an e-mail distribution list about lin-
guistics . It comprises 2,893 different e-mails, of which 2,412 are legitimate e-mails
obtained by downloading digests from the list and 481 are spam e-mails retrieved
from one of the authors of the corpus (for a more detailed description of the cor-
pus please refer to [14,15]). From the 4 different datasets provided in this corpus,
each of one with different pre-process steps, we choose the Bare dataset, which
has no pre-processing.
The SpamAssassin public mail corpus is a selection of 1,897 spam messages
and 4,150 legitimate e-mails. Unfortunatelly, due to computational restrictions
we were obliged to reduce the dataset to a 50%, so the final used dataset com-
prises 3,023 e-mails, of which 964 are spam e-mails and 2,059 are legitimate
messages.
In addition, we performed for both datasets a Stop Word Removal [16] based
on an external stop-word list 5 and removed any non alpha-numeric character.
We then used the Vector Space Model (VSM) [17], an algebraic approach for
Information Filtering (IF), Information Retrieval (IR), indexing and ranking,
to create the model. This model represents natural language documents in a
mathematical manner through vectors in a multidimensional space.
We extracted the top 1,000 attributes using Information Gain [18], an algo-
rithm that evaluates the relevance of an attribute by measuring the information
gain with respect to the class: IG ( j )= v j R
C i P ( v j ,C i )
·
P ( C i )) where C i is the i -th class, v j is the value of the j -th interpretation,
P ( v j ,C i ) is the probability that the j -th attribute has the value v j
·
( P ( v j ,C i ) / ( P ( v j )
in the class
3 http://nlp.cs.aueb.gr/software and datasets/lingspam public.tar.gz
4 http://spamassassin.apache.org/publiccorpus/
5 http://www.webconfs.com/stop-words.php
 
Search WWH ::




Custom Search