Information Technology Reference
In-Depth Information
4.1 Nigerian Fraud Emails
We acquired 542 different Nigerian Fraud Emails from an internet archive
[26]. We wish to cluster these emails in order to determine any commonality
in the authorship of the texts.
In the following experiment, we choose
{
bank, account, money, fund, busi-
ness, transaction
as the keyword set. Consider two emails: 2001-10-11.html ,
2002-08-27.html (Figure 5).
}
Fig. 5 Emails: 2001-10-11.html and 2002-08-27.html
The similarity between these two emails via the KF method is 1; the sim-
ilarity between these two emails via the KFP method is 0 . 999992. Reading
both emails shows that they are almost the same. For these two emails, both
algorithms provided proper estimation of their similarity.
This does not hold in general, for the following example shows a “false pos-
itive” output by KF method. Consider the pair of emails: 2002-02-20a.html ,
2002-07-04b.html (Figure 6).
Inspection of the documents clearly shows that they are written in very
different styles. The similarities estimated by KF and KFP methods are
1and0 . 43177, respectively. It is evident that one is not able to distinguish
these two emails by KF method, while the estimation of similarity by KFP
method is much more reasonable.
 
Search WWH ::




Custom Search