Spam Filters, Naive Bayes, and Wrangling - Doing Data Science - page 100

Databases Reference

In-Depth Information

p word = p word s pam p s pam + p word ham p ham

In other words, we've boiled it down to a counting exercise: p s pam

counts spam emails versus all emails, p word s pam counts the prev‐

alence of those spam emails that contain “word,” and p word ham

counts the prevalence of the ham emails that contain “word.”

To do this yourself, go online and download Enron emails . Let's build

a spam filter on that dataset. This really this means we're building a

new spam filter on top of the spam filter that existed for the employees

of Enron. We'll use their definition of spam to train our spam filter.

(This does mean that if the spammers have learned anything since

2001, we're out of luck.)

We could write a quick-and-dirty shell script in bash that runs this,

which Jake did. It downloads and unzips the file and creates a folder;

each text file is an email; spam and ham go in separate folders.

Let's look at some basic statistics on a random Enron employee's email.

We can count 1,500 spam versus 3,672 ham, so we already know

p s pam and p ham . Using command-line tools, we can also count

the number of instances of the word “meeting” in the spam folder:

grep -il meeting enron1/spam/*.txt | wc -l

This gives 16. Do the same for his ham folder, and we get 153. We can

now compute the chance that an email is spam only knowing it con‐

tains the word “meeting”:

p s pam = 1500 / 1500 + 3672 = . 29

p ham = . 71

p meeting s pam = 16 / 1500 = . 0106

p meeting ham = 153 / 3672 = . 0416

p s pam meeting

= p meeting s pam * p s pam / p meeting

=

. 0106 * . 29 /

. 0106 * . 29 + . 0416 * . 71 = 0 . 09 = 9 %

Take note that we didn't need a fancy programming environment to

get this done.

Next Page

Doing Data Science

Search WWH ::

Custom Search

Home