Information Technology Reference
In-Depth Information
Collective Classification for Spam Filtering
Carlos Laorden, Borja Sanz, Igor Santos, Patxi Galan-Garc ıa,
and Pablo G. Bringas
DeustoTech Computing - S 3 Lab, University of Deusto
Avenida de las Universidades 24, 48007 Bilbao, Spain
{ claorden,borja.sanz,isantos,patxigg,pablo.garcia.bringas } @deusto.es
Abstract. Spam has become a major issue in computer security be-
cause it is a channel for threats such as computer viruses, worms and
phishing. Many solutions feature machine-learning algorithms trained
using statistical representations of the terms that usually appear in the
e-mails. Still, these methods require a training step with labelled data.
Dealing with the situation where the availability of labelled training in-
stances is limited slows down the progress of filtering systems and offers
advantages to spammers. Currently, many approaches direct their efforts
into Semi-Supervised Learning (SSL). SSL is a halfway method between
supervised and unsupervised learning, which, in addition to unlabelled
data, receives some supervision information such as the association of
the targets with some of the examples. Collective Classification for Text
Classification poses as an interesting method for optimising the classifi-
cation of partially-labelled data. In this way, we propose here, for the first
time, Collective Classification algorithms for spam filtering to overcome
the amount of unclassified e-mails that are sent every day.
Keywords: Spam filtering, collective classification, semi-supervised
learning.
1
Introduction
Flooding inboxes with annoying and time-consuming messages, more than 85%
of received e-mails are spam 1 .
Several approaches have been proposed by the academic community to solve
the spam problem [1,2,3,4]. Among them, the termed as statistical approaches
[5] use machine-learning techniques to classify e-mails. These approaches have
proved their e ciency detecting spam and are the most extended technique to
fight it. In particular, the use of the Bayes' theorem is widely used by the anti-
spam filters (e.g., SpamAssasin [6], Bogofilter [7], and Spamprobe [8]).
These statistical approaches are usually supervised, i.e., they need a train-
ing set of previously labelled samples. These techniques perform better as more
training instances are available. It means that a significant amount of previous
labelling work is needed to increase the accuracy of the models. This work in-
cludes a gathering phase in which as many e-mails as possible are collected. Then,
 
Search WWH ::




Custom Search