Information Technology Reference
In-Depth Information
malicious, the size of the body, etc. Hot-listed “dirty words” and n-gram models and
their frequency of occurrence are among the email message content-based linguistic
features supported. We describe these next.
EMT is also being extended to include the profiles of the sender and recipient
email accounts, and their clique behavior, as features for the supervised learning
component.
3.2
Content-Based Classification of Emails
In addition to using flow statistics about an email to classify an email message, we
also use the email body as a content-based feature. There are two choices we have
explored for features extracted from the contents of the email. One is the n-gram [16]
model, and the other is a calculation of the frequency of a set of words [17].
An N-gram represents the sequence of any n adjacent characters or tokens that ap-
pear in a document. We pass an n-character wide window through the entire email
body, one character at a time, and count the number of occurrences of each n-gram.
This results in a hash table that uses the n-gram as a key and the number of occur-
rences as the value for one email; this may be called a document vector.
Given a set of training emails, we use the arithmetic average of the document vec-
tors as the centroid for that set. For any test email, we compute the cosine distance
[16] against the centroid created for the training set. If the cosine distance is 1, then
the two documents are deemed identical. The smaller the value of the cosine distance,
the more different the two documents are.
The formula for the cosine distance is:
J
J
J
2
2
1
/
2
D
(
x
,
y
)
=
x
y
/(
x
y
)
=
cos
θ
(11)
j
j
j
k
xy
j
=
1
j
=
1
k
=
1
Here J is the total number of possible n-grams appearing in the training set and
the test email. x is the document vector for a test email, and y is the centroid for the
training set. x represents the frequency of the j th n-gram (the n-grams can be sorted
uniquely) occurring in the test email. Similarly y k represents the frequency of the k th
n-gram of the centroid.
A similar approach is used for the words in the documents instead of the n-grams.
The classification is based on Naïve Bayes learning [17]. Given labeled training data
and some test cases, we compute the likelihood that the test case is a member of each
class. We then assign the test email to the most likely class.
These content-based methods are integrated into the machine learning models for
classifying sets of emails for further inspection and analysis.
Using a set of normal email and spam we collected, we did some initial experi-
ments. We use half of the labeled emails, both normal and spams, as training set, and
use the other half as the test set. The accuracy of the classification using n-grams and
word tokens varies from 70% to 94% when using different part as training and testing
sets.
It's challenging to figure out spam using only content because some of the spam
content are really like our normal ones. To improve the accuracy we also use
weighted key words and stopwords techniques. For example, the spams also contain
Search WWH ::




Custom Search