Information Technology Reference
In-Depth Information
and Random Woods behave similar, with very poor recall, achieving maximums
with 90% of 0.16 and 0.20 respectively for Ling Spam and 0.68 and 0.66 for
SpamAssassin.
Finally, Fig. 3 shows the Area under de ROC curve (AUC) of the different
algorithms. Once more, the performance of Collective KNN increases with more
known instances: from 0.64 with 10% to 0.87 with 90% for Ling Spam and
from 0.56 to 0.90 for SpamAssassin. Collective Forest offers a perfect 1.00 for
every configuration with Ling Spam and a minimum of 0.99 with SpamAssassin
posing as a suitable choice for collective classification. Finally, Collective Woods
and Random Woods offer similar results, increasing from 0.86 both to 0.92 and
0.90 respectively with Ling Spam and from 0.93 and 0.92 to 0.94 both with
SpamAssassin.
4 Discussion and Concluding Remarks
Collective Classification algorithms for spam filtering pose as a suitable approach
for optimising the classification of partially-labelled data and, therefore, over-
come the amount of unclassified spam e-mails that are created every day.
In particular, Collective Forest shows great results for every configuration of
known instances (i.e., different sizes for the
set of known instances), with
values above 0.93 of precision, above 0.90 of recall (only offering a poor recall of
0.78 with a 10% of
X
) and almost 1.00 for all configurations of AUC.
Since precision and AUC are slightly affected with the variation of known
instances, values of
X
, to determine the optimal size of labelled data, and as-
suming that Collective Forest is the chosen algorithm, the recall should be the
factor to take into account. For a value of
X
= 60%, CollectiveForest achieves
its maximums, only experiencing a loss of 0.03 of recall for Ling Spam.
As the number of unsolicited bulk messages increases, the classification and
labelling steps, that commonly supervised methods make use of, become more
unattainable. To revert this situation, we propose the first spam filtering system
that uses collective classification to optimise classification performance. Through
the algorithms introduced, the necessity of labelled e-mails is minimised, by a
40%, without a significant penalisation in the detection capabilities.
Future work will be focused on three main directions. First, we plan to extend
our study of collective classification by applying more algorithms to the spam
problem. Second, we will select different features as data to train the models.
Finally, we will perform a more complete analysis on the effects of the labelled
degree of the data.
X
References
1. Robinson, G.: A statistical approach to the spam problem. Linux J. 3 (March 2003)
2. Chirita, P., Diederich, J., Nejdl, W.: MailRank: using ranking for spam detection.
In: Proceedings of the 14th ACM International Conference on Information and
Knowledge Management, pp. 373-380. ACM, New York (2005)
 
Search WWH ::




Custom Search