Databases Reference
In-Depth Information
Figure 4-1. Suspiciously spammy
Rachel's class had a few ideas about what things might be clear signs
of spam:
• Any email is spam if it contains Viagra references. That's a good
rule to start with, but as you've likely seen in your own email,
people figured out this spam filter rule and got around it by mod‐
ifying the spelling. (It's sad that spammers are so smart and aren't
working on more important projects than selling lots of Viagra…)
• Maybe something about the length of the subject gives it away as
spam, or perhaps excessive use of exclamation points or other
punctuation. But some words like “Yahoo!” are authentic, so you
don't want to make your rule too simplistic.
And here are a few suggestions regarding code you could write to
identify spam:
• Try a probabilistic model. In other words, should you not have
simple rules, but have many rules of thumb that aggregate together
to provide the probability of a given email being spam? This is a
great idea.
• What about k-nearest neighbors or linear regression? You learned
about these techniques in the previous chapter, but do they apply
to this kind of problem? (Hint: the answer is “No.”)
In this chapter, we'll use Naive Bayes to solve this problem, which is
in some sense in between the two. But first…
 
Search WWH ::




Custom Search