Spam Filters, Naive Bayes, and Wrangling - Doing Data Science

Databases Reference

In-Depth Information

Figure 4-1. Suspiciously spammy

Rachel's class had a few ideas about what things might be clear signs

of spam:

• Any email is spam if it contains Viagra references. That's a good

rule to start with, but as you've likely seen in your own email,

people figured out this spam filter rule and got around it by mod‐

ifying the spelling. (It's sad that spammers are so smart and aren't

working on more important projects than selling lots of Viagra…)

• Maybe something about the length of the subject gives it away as

spam, or perhaps excessive use of exclamation points or other

punctuation. But some words like “Yahoo!” are authentic, so you

don't want to make your rule too simplistic.

And here are a few suggestions regarding code you could write to

identify spam:

• Try a probabilistic model. In other words, should you not have

simple rules, but have many rules of thumb that aggregate together

to provide the probability of a given email being spam? This is a

great idea.

• What about k-nearest neighbors or linear regression? You learned

about these techniques in the previous chapter, but do they apply

to this kind of problem? (Hint: the answer is “No.”)

In this chapter, we'll use Naive Bayes to solve this problem, which is

in some sense in between the two. But first…

Search WWH ::

Custom Search

Home