Database Reference
In-Depth Information
For a more rigorous explanation, check out Mark Steyvers's introduction Probabilistic Topic
Models , which you can see at http://psiexp.ss.uci.edu/research/papers/
SteyversGriffithsLSABookFormatted.pdf
For some information on how to evaluate the topics that you get, see http://homepages.
inf.ed.ac.uk/imurray2/pub/09etm
Performing naïve Bayesian classiication
with MALLET
MALLET has gotten its reputation as a library for topic modeling. However, it also has a lot of
other algorithms in it.
One popular algorithm that MALLET implements is naïve Bayesian classiication. If you have
documents that are already divided into categories, you can train a classiier to categorize
new documents into those same categories. Often, this works surprisingly well.
One common use for this is in spam e-mail detection. We'll use this as our example here too.
Getting ready
We'll need to have MALLET included in our project.clj ile:
(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
:dependencies [[org.clojure/clojure "1.6.0"]
[cc.mallet/mallet "2.0.7"]])
Just as in the Performing topic modeling with MALLET recipe, the list of classes to be included
is a little long, but most of them are for the processing pipeline, as shown here:
(require '[clojure.java.io :as io])
(import [cc.mallet.util.*]
[cc.mallet.types InstanceList]
[cc.mallet.pipe
Input2CharSequence TokenSequenceLowercase
CharSequence2TokenSequence SerialPipes
SaveDataInSource Target2Label
TokenSequence2FeatureSequence
TokenSequenceRemoveStopwords
FeatureSequence2AugmentableFeatureVector]
[cc.mallet.pipe.iterator FileIterator]
[cc.mallet.classify NaiveBayesTrainer])
 
Search WWH ::




Custom Search