An Immunological Filter for Spam - Artificial Immune Systems

Information Technology Reference

In-Depth Information

4.2

Dimensionality Reduction

When we are dealing with textual information, the feature space tends to be

large, usually on the order of several thousands of attributes (words). Hence, a

method to reduce this number of attributes is required. According to [20] the

attributes that appear in most of the files are not relevant in order to separate

these documents because all the classes have instances that contain those at-

tributes. In addition, as we are working with only two different classes (spam

and legitimate), words that appear rarely in the files have a low weight in the

identification of the class. So, the attributes that appear less than 5% and more

than 95% in all documents of the corpus were removed. At the final, the di-

mension of the feature vectors is 751. The benefit of dimension reduction also

includes, in some cases, an improvement in prediction accuracy [21].

5

Performance Measures

Once generated a classifier, it is necessary to obtain some indexes that can

measure its performance and facilitate the comparison with other classifiers. In

pattern recognition and information retrieval, when there are multiple categories,

performance measures such as recall and precision are used. Although spam

detection is a binary classification task, these measures will be used here to

estimate the accuracy of the methods.

We will adopt the same notation used in [4,22], using L and S to represent

legitimate and spam message respectively; and n L → S (legitimate to spam or

false positive) and n S → L (spam to legitimate or false negative) to denote the

two error types, respectively. Then, the spam recall and the spam precision are

defined here as follows in equations 4 and 5.

n S → S

n S → S + n S → L

SR =

(4)

n S → S

n S → S + n L → S

SP =

(5)

In anti-spam filtering, misclassifying a legitimate mail as spam is worse than

letting a spam message pass the filter. If a spam goes through the filter, the only

inconvenience that it may cause is the time wasted to remove that message from

the inbox. However, if an important legitimate mail message was misclassified, a

real disaster can happen. When the error types (false positive and false negative)

have distinct relevance the usual precision and recall measures can not express

well the performance and it is necessary to adopt some cost sensitive evaluation

measures.

Androutsoupoulos et al. [4] introduced a weighted accuracy measure (WAcc)

that assign to false positive a higher cost than false negative and has been used

in some spam filtering benchmarks [4,8,22]. WAcc is defined as:

λ.n L → L + n S → S

λ.N L + N S

WE r = λ.n L → S + n S → L

λ.N L + N S

WAcc =

,

(6)

Artificial Immune Systems

Search WWH ::

Custom Search

Home