Working with Unstructured and Textual Data - Clojure Data Analysis - page 273

Database Reference

In-Depth Information

3.

Finally, these two are relatively short and not strictly necessary, but it is good to have

these two utility functions:

(defn train [instance-list]

(.train (NaiveBayesTrainer.) instance-list))

(defn classify [bayes instance-list]

(.classify bayes instance-list))

Now, we can use these functions to load the training documents from the training

directory, train the classiier, and use it to classify the test iles:

(def pipe (make-pipe-list))

(def instance-list (add-input-directory "training" pipe))

(def bayes (train instance-list))

Now we can use it to classify the test iles.

(def test-list (add-input-directory "test-data" pipe))

(def classes (classify bayes test-list))

Moreover, inding the results just takes digging into the class structure:

user=> (.. (first (seq classes)) getLabeling getBestLabel

toString)

"hard_ham"

We can use this to construct a matrix that shows how the classiier performs, as follows:

Expected ham

Expected spam

Actually ham

246

99

Actually spam

4

402

From this confusion matrix, you can see that it does pretty well. Moreover, it errs on

misclassifying spam as ham. This is good because this means that we'd only need to

dig into our spam folder for four emails.

How it works…

Naïve Bayesian classiiers work by starting with a reasonable guess about how likely a set of

features are to be marked as spam. Often, this might be 50/50. Then, as it sees more and

more documents and their classiications, it modiies this model, getting better results.

For example, it might notice that the word free is found in 100 ham emails but in 900

spam emails. This makes it a very strong indicator of spam, and the classiier will update its

expectations accordingly. It then combines all of the relevant probabilities from the features it

sees in a document in order to classify it one way or the other.

Next Page

Clojure Data Analysis

Search WWH ::

Custom Search

Home