A Behavior-Based Approach to Securing Email Systems - Computer Network Security

Information Technology Reference

In-Depth Information

malicious, the size of the body, etc. Hot-listed “dirty words” and n-gram models and

their frequency of occurrence are among the email message content-based linguistic

features supported. We describe these next.

EMT is also being extended to include the profiles of the sender and recipient

email accounts, and their clique behavior, as features for the supervised learning

component.

3.2

Content-Based Classification of Emails

In addition to using flow statistics about an email to classify an email message, we

also use the email body as a content-based feature. There are two choices we have

explored for features extracted from the contents of the email. One is the n-gram [16]

model, and the other is a calculation of the frequency of a set of words [17].

An N-gram represents the sequence of any n adjacent characters or tokens that ap-

pear in a document. We pass an n-character wide window through the entire email

body, one character at a time, and count the number of occurrences of each n-gram.

This results in a hash table that uses the n-gram as a key and the number of occur-

rences as the value for one email; this may be called a document vector.

Given a set of training emails, we use the arithmetic average of the document vec-

tors as the centroid for that set. For any test email, we compute the cosine distance

[16] against the centroid created for the training set. If the cosine distance is 1, then

the two documents are deemed identical. The smaller the value of the cosine distance,

the more different the two documents are.

The formula for the cosine distance is:

∑

(

)

cos

(11)

Here J is the total number of possible n-grams appearing in the training set and

the test email. x is the document vector for a test email, and y is the centroid for the

training set. x represents the frequency of the j th n-gram (the n-grams can be sorted

uniquely) occurring in the test email. Similarly y k represents the frequency of the k th

n-gram of the centroid.

A similar approach is used for the words in the documents instead of the n-grams.

The classification is based on Naïve Bayes learning [17]. Given labeled training data

and some test cases, we compute the likelihood that the test case is a member of each

class. We then assign the test email to the most likely class.

These content-based methods are integrated into the machine learning models for

classifying sets of emails for further inspection and analysis.

Using a set of normal email and spam we collected, we did some initial experi-

ments. We use half of the labeled emails, both normal and spams, as training set, and

use the other half as the test set. The accuracy of the classification using n-grams and

word tokens varies from 70% to 94% when using different part as training and testing

sets.

It's challenging to figure out spam using only content because some of the spam

content are really like our normal ones. To improve the accuracy we also use

weighted key words and stopwords techniques. For example, the spams also contain

Computer Network Security

Search WWH ::

Custom Search

Home