Working with Unstructured and Textual Data - Clojure Data Analysis

Database Reference

In-Depth Information

For a more rigorous explanation, check out Mark Steyvers's introduction Probabilistic Topic

Models , which you can see at http://psiexp.ss.uci.edu/research/papers/

For some information on how to evaluate the topics that you get, see http://homepages.

Performing naïve Bayesian classiication

with MALLET

MALLET has gotten its reputation as a library for topic modeling. However, it also has a lot of

other algorithms in it.

One popular algorithm that MALLET implements is naïve Bayesian classiication. If you have

documents that are already divided into categories, you can train a classiier to categorize

new documents into those same categories. Often, this works surprisingly well.

One common use for this is in spam e-mail detection. We'll use this as our example here too.

Getting ready

We'll need to have MALLET included in our project.clj ile:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"

:dependencies [[org.clojure/clojure "1.6.0"]

[cc.mallet/mallet "2.0.7"]])

Just as in the Performing topic modeling with MALLET recipe, the list of classes to be included

is a little long, but most of them are for the processing pipeline, as shown here:

(require '[clojure.java.io :as io])

(import [cc.mallet.util.*]

[cc.mallet.types InstanceList]

[cc.mallet.pipe

Input2CharSequence TokenSequenceLowercase

CharSequence2TokenSequence SerialPipes

SaveDataInSource Target2Label

TokenSequence2FeatureSequence

TokenSequenceRemoveStopwords

FeatureSequence2AugmentableFeatureVector]

[cc.mallet.pipe.iterator FileIterator]

[cc.mallet.classify NaiveBayesTrainer])

Search WWH ::

Custom Search

Home