Spam Filters, Naive Bayes, and Wrangling - Doing Data Science - page 99

Databases Reference

In-Depth Information

Recall from your basic statistics course that, given events x and y ,

there's a relationship between the probabilities of either event (denoted

p x and p y ), the joint probabilities (both happen, which is denoted

p x , y ), and conditional probabilities (event x happens given y hap‐

pens, denoted p x y ) as follows:

p y x p x = p x , y = p x y p y

Using that, we solve for p y x (assuming p x ≠0 ) to get what is called

Bayes' Law :

p x y p y

p x

p y x =

The denominator term, p x , is often implicitly computed and can

thus be treated as a “normalization constant.” In our current situation,

set y to refer to the event “I am sick,” or “sick” for shorthand; and set

x to refer to the event “the test is positive,” or “+” for shorthand. Then

we actually know, or at least can compute, every term:

p + sick p sick

p +

0 . 99 · 0 . 01

0 . 99 · 0 . 01 + 0 . 01 · 0 . 99

p sick +

=

=

= 0 . 50 = 50 %

A Spam Filter for Individual Words

So how do we use Bayes' Law to create a good spam filter? Think about

it this way: if the word “Viagra” appears, this adds to the probability

that the email is spam. But it's not conclusive, yet. We need to see what

else is in the email.

Let's first focus on just one word at a time, which we generically call

“word.” Then, applying Bayes' Law, we have:

p word s pam p s pam

p word

p s pam word =

The righthand side of this equation is computable using enough pre-

labeled data. If we refer to nonspam as “ham” then we only need com‐

pute p(word|spam) , p(word|ham) , p(spam) , and p(ham) = 1-p(spam) ,

because we can work out the denominator using the formula we used

earlier in our medical test example, namely:

Next Page

Doing Data Science

Search WWH ::

Custom Search

Home