Spam Filters, Naive Bayes, and Wrangling - Doing Data Science

Databases Reference

In-Depth Information

• “Idiot's Bayes - not so stupid after all?” (The whole paper is about

why it doesn't suck, which is related to redundancies in language.)

mation”

Fancy It Up: Laplace Smoothing

Remember the θ j from the previous section? That referred to the

probability of seeing a given word (indexed by j ) in a spam email. If

you think about it, this is just a ratio of counts: θ j = n jc / n c , where n jc

denotes the number of times that word appears in a spam email and

n c denotes the number of times that word appears in any email.

Laplace Smoothing refers to the idea of replacing our straight-up es‐

timate of θ j with something a bit fancier:

θ jc = n jc + α / n c + β

We might fix α = 1 and β = 10, for example, to prevent the possibility

of getting 0 or 1 for a probability, which we saw earlier happening with

“viagra.” Does this seem totally ad hoc? Well, if we want to get fancy,

we can see this as equivalent to having a prior and performing a max‐

imal likelihood estimate. Let's get fancy! If we denote by ML the max‐

imal likelihood estimate, and by D the dataset, then we have:

θ ML = ar gmax θ p D θ

In other words, the vector of values θ j = n jc / n c is the answer to the

question: for what value of θ were the data D most probable? If we

assume independent trials again, as we did in our first attempt at Naive

Bayes, then we want to choose the θ j to separately maximize the fol‐

lowing quantity for each j :

log θ j n jc 1− θ j

n c − n jc

If we take the derivative, and set it to zero, we get:

Search WWH ::

Custom Search

Home