Databases Reference
In-Depth Information
Recall from your basic statistics course that, given events x and y ,
there's a relationship between the probabilities of either event (denoted
p x and p y ), the joint probabilities (both happen, which is denoted
p x , y ), and conditional probabilities (event x happens given y hap‐
pens, denoted p x y ) as follows:
p y x p x = p x , y = p x y p y
Using that, we solve for p y x (assuming p x ≠0 ) to get what is called
Bayes' Law :
p x y p y
p x
p y x =
The denominator term, p x , is often implicitly computed and can
thus be treated as a “normalization constant.” In our current situation,
set y to refer to the event “I am sick,” or “sick” for shorthand; and set
x to refer to the event “the test is positive,” or “+” for shorthand. Then
we actually know, or at least can compute, every term:
p + sick p sick
p +
0 . 99 · 0 . 01
0 . 99 · 0 . 01 + 0 . 01 · 0 . 99
p sick +
=
=
= 0 . 50 = 50 %
A Spam Filter for Individual Words
So how do we use Bayes' Law to create a good spam filter? Think about
it this way: if the word “Viagra” appears, this adds to the probability
that the email is spam. But it's not conclusive, yet. We need to see what
else is in the email.
Let's first focus on just one word at a time, which we generically call
“word.” Then, applying Bayes' Law, we have:
p word s pam p s pam
p word
p s pam word =
The righthand side of this equation is computable using enough pre-
labeled data. If we refer to nonspam as “ham” then we only need com‐
pute p(word|spam) , p(word|ham) , p(spam) , and p(ham) = 1-p(spam) ,
because we can work out the denominator using the formula we used
earlier in our medical test example, namely:
Search WWH ::




Custom Search