Database Reference
In-Depth Information
Gutierrez: How is spam typically combated?
Foreman: Over a decade ago, Paul Graham wrote an essay called “A Plan
for Spam.” Since then, there's been a whole lot of work done around spam
detection. If you go back and read those early essays, the way it was dealt with
was looking at the actual content of the email. People looked at the words on
the page and put them through a model to get a sense of whether the words
in the email were about something bad or something spammy. So of course,
this started an escalation between spammers and people trying to stop them.
All of a sudden Viagra has an @ symbol in it, a one numeral instead of an “I”,
or similar things like that. The spammers were trying to get around matching
tokens, naive Bayes filters, and similar spam detection techniques. This kept
escalating and escalating. Then, rather than using words, spammers started
using images with words, which meant that the people trying to stop spam-
mers had to start doing OCR [Optical Character Recognition] on images to get
out the tokens to put them through their models in order to identify spam.
This continued until it got to the point where there needed to be a new
approach. It became very clear to us that it's not good enough to just look at
the content spammers are sending. We're an international company, so the
content we are analyzing could be in many, many languages. It could also be
all images. Or, most worrying—and perhaps the biggest problem—is that the
content could look perfectly fine. And this is something I don't think people
realize about spam. The modern assumption that classical spam filters oper-
ate under is that spam embodies a platonic ideal such that spam is about
Nigerian princes, Viagra, or similar sorts of things.
Gutierrez: How is MailChimp combatting spam?
Foreman: We think about spam using the postmodern definition that “spam
is in the eye of the reader.” We know this to be true, because email clients
have spam buttons. It's you—the person who receives the email—that deems
an email to be spam. It not words on the page. It's how you interpret those
words and your relationship with the sender. This is fascinating from a data
science perspective.
How we combat spam is by going after the relationship between the sender
and the receiver. The content of the email that is sent out could be very
innocuous. It could be local real estate ads, let's say. So in this example it
could be a real estate agent sending out ads for houses that are for sale in
a specific market. What we ask and try to figure out is how this real estate
agent procured the list of people they are contacting via email. Are these
people clients they've worked with in the past, and they told the agent, “Yeah,
I'd love to see your content in the future!”? Or did the agent go to a local
Chamber of Commerce meeting and take the email list? Did the agent do a
public information request to the state and get public university employee
email addresses—which we've seen people do? Did the agent scrape email
 
Search WWH ::




Custom Search