Collective Classification for Spam Filtering - Computational Intelligence in Security for Information Systems

Information Technology Reference

In-Depth Information

Collective Classification for Spam Filtering

Carlos Laorden, Borja Sanz, Igor Santos, Patxi Galan-Garc ıa,

and Pablo G. Bringas

DeustoTech Computing - S 3 Lab, University of Deusto

Avenida de las Universidades 24, 48007 Bilbao, Spain

{ claorden,borja.sanz,isantos,patxigg,pablo.garcia.bringas } @deusto.es

Abstract. Spam has become a major issue in computer security be-

cause it is a channel for threats such as computer viruses, worms and

phishing. Many solutions feature machine-learning algorithms trained

using statistical representations of the terms that usually appear in the

e-mails. Still, these methods require a training step with labelled data.

Dealing with the situation where the availability of labelled training in-

stances is limited slows down the progress of filtering systems and offers

advantages to spammers. Currently, many approaches direct their efforts

into Semi-Supervised Learning (SSL). SSL is a halfway method between

supervised and unsupervised learning, which, in addition to unlabelled

data, receives some supervision information such as the association of

the targets with some of the examples. Collective Classification for Text

Classification poses as an interesting method for optimising the classifi-

cation of partially-labelled data. In this way, we propose here, for the first

time, Collective Classification algorithms for spam filtering to overcome

the amount of unclassified e-mails that are sent every day.

Keywords: Spam filtering, collective classification, semi-supervised

learning.

1

Introduction

Flooding inboxes with annoying and time-consuming messages, more than 85%

of received e-mails are spam 1 .

Several approaches have been proposed by the academic community to solve

the spam problem [1,2,3,4]. Among them, the termed as statistical approaches

[5] use machine-learning techniques to classify e-mails. These approaches have

proved their e ciency detecting spam and are the most extended technique to

fight it. In particular, the use of the Bayes' theorem is widely used by the anti-

spam filters (e.g., SpamAssasin [6], Bogofilter [7], and Spamprobe [8]).

These statistical approaches are usually supervised, i.e., they need a train-

ing set of previously labelled samples. These techniques perform better as more

training instances are available. It means that a significant amount of previous

labelling work is needed to increase the accuracy of the models. This work in-

cludes a gathering phase in which as many e-mails as possible are collected. Then,

Search WWH ::

Custom Search

Home