Databases Reference
In-Depth Information
p word = p word s pam p s pam + p word ham p ham
In other words, we've boiled it down to a counting exercise: p s pam
counts spam emails versus all emails, p word s pam counts the prev‐
alence of those spam emails that contain “word,” and p word ham
counts the prevalence of the ham emails that contain “word.”
To do this yourself, go online and download Enron emails . Let's build
a spam filter on that dataset. This really this means we're building a
new spam filter on top of the spam filter that existed for the employees
of Enron. We'll use their definition of spam to train our spam filter.
(This does mean that if the spammers have learned anything since
2001, we're out of luck.)
We could write a quick-and-dirty shell script in bash that runs this,
which Jake did. It downloads and unzips the file and creates a folder;
each text file is an email; spam and ham go in separate folders.
Let's look at some basic statistics on a random Enron employee's email.
We can count 1,500 spam versus 3,672 ham, so we already know
p s pam and p ham . Using command-line tools, we can also count
the number of instances of the word “meeting” in the spam folder:
grep -il meeting enron1/spam/*.txt | wc -l
This gives 16. Do the same for his ham folder, and we get 153. We can
now compute the chance that an email is spam only knowing it con‐
tains the word “meeting”:
p s pam = 1500 / 1500 + 3672 = . 29
p ham = . 71
p meeting s pam = 16 / 1500 = . 0106
p meeting ham = 153 / 3672 = . 0416
p s pam meeting
= p meeting s pam * p s pam / p meeting
=
. 0106 * . 29 /
. 0106 * . 29 + . 0416 * . 71 = 0 . 09 = 9 %
Take note that we didn't need a fancy programming environment to
get this done.
Search WWH ::




Custom Search