Databases Reference
In-Depth Information
Next, we can try:
• “money”: 80% chance of being spam
• “viagra”: 100% chance
• “enron”: 0% chance
This illustrates that the model, as it stands, is overfitting; we are getting
overconfident because of biased data. Is it really a slam-dunk that any
email containing the word “Viagra” is spam? It's of course possible to
write a nonspam email with the word “Viagra,” as well as a spam email
with the word “Enron.”
A Spam Filter That Combines Words: Naive Bayes
Next, let's do it for all the words. Each email can be represented by a
binary vector, whose j th entry is 1 or 0 depending on whether the j th
word appears. Note this is a huge-ass vector, considering how many
words we have, and we'd probably want to represent it with the indices
of the words that actually show up.
The model's output is the probability that we'd see a given word vector
given that we know it's spam (or that it's ham). Denote the email vector
to be x and the various entries x j , where the j indexes the words. For
now we can denote “is spam” by c , and we have the following model
for p x c , i.e., the probability that the email's vector looks like this
considering it's spam:
p x c =∏ j θ jc x j 1− θ jc
1− x j
The θ here is the probability that an individual word is present in a
spam email. We saw how to compute that in the previous section via
counting, and so we can assume we've separately and parallel-ly com‐
puted that for every word.
We are modeling the words independently (also known as “independ‐
ent trials”), which is why we take the product on the righthand side of
the preceding formula and don't count how many times they are
present. That's why this is called “naive,” because we know that there
are actually certain words that tend to appear together, and we're ig‐
noring this.
Search WWH ::




Custom Search