Database Reference
In-Depth Information
3.
Finally, these two are relatively short and not strictly necessary, but it is good to have
these two utility functions:
(defn train [instance-list]
(.train (NaiveBayesTrainer.) instance-list))
(defn classify [bayes instance-list]
(.classify bayes instance-list))
Now, we can use these functions to load the training documents from the training
directory, train the classiier, and use it to classify the test iles:
(def pipe (make-pipe-list))
(def instance-list (add-input-directory "training" pipe))
(def bayes (train instance-list))
Now we can use it to classify the test iles.
(def test-list (add-input-directory "test-data" pipe))
(def classes (classify bayes test-list))
Moreover, inding the results just takes digging into the class structure:
user=> (.. (first (seq classes)) getLabeling getBestLabel
toString)
"hard_ham"
We can use this to construct a matrix that shows how the classiier performs, as follows:
Expected ham
Expected spam
Actually ham
246
99
Actually spam
4
402
From this confusion matrix, you can see that it does pretty well. Moreover, it errs on
misclassifying spam as ham. This is good because this means that we'd only need to
dig into our spam folder for four emails.
How it works…
Naïve Bayesian classiiers work by starting with a reasonable guess about how likely a set of
features are to be marked as spam. Often, this might be 50/50. Then, as it sees more and
more documents and their classiications, it modiies this model, getting better results.
For example, it might notice that the word free is found in 100 ham emails but in 900
spam emails. This makes it a very strong indicator of spam, and the classiier will update its
expectations accordingly. It then combines all of the relevant probabilities from the features it
sees in a document in order to classify it one way or the other.
Search WWH ::




Custom Search