Information Technology Reference
In-Depth Information
3.4 Anti-spam filtering
The E-mail service is a computer-based technology built as the result of transforming the old
postal delivery in order to use it over networks and Internet. Nowadays, e-mail addresses
are present on every business card close to other relevant contact info such as the postal
address or the phone number. However, for more than one decade the use of e-mail has
been bedeviled by the curse of spamming, so spam is beginning to undermine the integrity
of e-mail and even to discourage its use.
In this context, spam is a term used to designate all forms of unsolicited commercial e-mail
and can be formally defined as an electronic message satisfying the following two
conditions: ( i ) the recipient's personal identity and context are irrelevant because the
message is equally applicable to many other potential recipients and ( ii ) the recipient has not
verifiably granted deliberate, explicit, and still-revocable permission for it to be sent
(SpamHaus, 1998).
Due to some attractive characteristics of e-mail (low cost & fast delivery) it actually becomes
the main distribution channel of spam contents. Every day e-mail users receive lots of
messages containing offers to buy illegal drugs, replicas of Swiss watches, fake jobs, forged
university diplomas, etc. This situation has led to a progressive increasing of the spam
global ratio in email traffic. During September 2010, the percentage of spam deliveries
accounted for about 92 percent of all Internet e-mail traffic (MessageLabs, 2010).
In order to successfully fight against spam (i.e. ideally eliminate it), both theoretical and
applied research on spam filtering becomes fundamental. In this context, much valuable
research work has been previously carried out (Guzella & Caminhas, 2009) and some
relevant conferences have grown up in the field (CEAS, 2010). Moreover, several
commercial products have been released and distributed from the software industry to a
huge amount of final users with the goal of minimizing spam drawbacks.
With the goal of providing an effective solution we present the SpamHunting system
(Fdez-Riverola et al. 2007), an instance-based reasoning e-mail filtering model that
outperforms classical machine learning techniques and other successful lazy learner's
approaches in the domain of anti-spam filtering. The architecture of the decision support
filter is based on a tuneable enhanced instance retrieval network able to accurately
generalize e-mail representations. The reuse of similar messages is carried out by a simple
unanimous voting mechanism to determine whether the target case is spam or not.
Previous to the final response of the system, the revision stage is only performed when
the assigned class is spam whereby the system employs general knowledge in the form of
meta-rules.
In order to correctly represent incoming e-mails, a message descriptor (instance) is
generated and stored in the e-mail base of the SpamHunting system. This message
descriptor contains the sequence of features that better summarize the information
contained in the e-mail. For this purpose, we use data from two main sources: ( i )
information obtained from the header of the e-mail and ( ii ) those terms that are more
representative of the subject, body and attachments of the message. Table 4 summarizes the
structure of each instance stored in the SpamHunting e-mail base.
Figure 10 illustrates the life cycle of the IBR SpamHunting system as well as its integration
within a typical user environment. In the upper part of Figure 10, the mail user agent
(MUA) and the mail transfer agent (MTA) are in charge of dispatching the requests
generated by the user. Between these two applications, SpamHunting captures all the
incoming messages (using POP3 protocol) in order to identify, tag and filter spam.
Search WWH ::




Custom Search