Spam Filters, Naive Bayes, and Wrangling - Doing Data Science

Databases Reference

In-Depth Information

Naive Bayes

So are we at a loss now that two methods we're familiar with, linear

regression and k-NN, won't work for the spam filter problem? No!

Naive Bayes is another classification method at our disposal that scales

well and has nice intuitive appeal.

Bayes Law

Let's start with an even simpler example than the spam filter to get a

feel for how Naive Bayes works. Let's say we're testing for a rare disease,

where 1% of the population is infected. We have a highly sensitive and

specific test, which is not quite perfect:

• 99% of sick patients test positive.

• 99% of healthy patients test negative.

Given that a patient tests positive, what is the probability that the pa‐

tient is actually sick?

A naive approach to answering this question is this: Imagine we have

100 × 100 = 10,000 perfectly representative people. That would mean

that 100 are sick, and 9,900 are healthy. Moreover, after giving all of

them the test we'd get 99 sick people testing sick, but 99 healthy people

testing sick as well. If you test positive, in other words, you're equally

likely to be healthy or sick; the answer is 50%. A tree diagram of this

approach is shown in Figure 4-3 .

Figure 4-3. Tree diagram to build intuition

Let's do it again using fancy notation so we'll feel smart.

Search WWH ::

Custom Search

Home