Spam Filters, Naive Bayes, and Wrangling - Doing Data Science

Databases Reference

In-Depth Information

Next, we can try:

• “money”: 80% chance of being spam

• “viagra”: 100% chance

• “enron”: 0% chance

This illustrates that the model, as it stands, is overfitting; we are getting

overconfident because of biased data. Is it really a slam-dunk that any

email containing the word “Viagra” is spam? It's of course possible to

write a nonspam email with the word “Viagra,” as well as a spam email

with the word “Enron.”

A Spam Filter That Combines Words: Naive Bayes

Next, let's do it for all the words. Each email can be represented by a

binary vector, whose j th entry is 1 or 0 depending on whether the j th

word appears. Note this is a huge-ass vector, considering how many

words we have, and we'd probably want to represent it with the indices

of the words that actually show up.

The model's output is the probability that we'd see a given word vector

given that we know it's spam (or that it's ham). Denote the email vector

to be x and the various entries x j , where the j indexes the words. For

now we can denote “is spam” by c , and we have the following model

for p x c , i.e., the probability that the email's vector looks like this

considering it's spam:

p x c =∏ j θ jc x j 1− θ jc

1− x j

The θ here is the probability that an individual word is present in a

spam email. We saw how to compute that in the previous section via

counting, and so we can assume we've separately and parallel-ly com‐

puted that for every word.

We are modeling the words independently (also known as “independ‐

ent trials”), which is why we take the product on the righthand side of

the preceding formula and don't count how many times they are

present. That's why this is called “naive,” because we know that there

are actually certain words that tend to appear together, and we're ig‐

noring this.

Search WWH ::

Custom Search

Home