Database Reference
In-Depth Information
How to do it…
We'll need to work the documents through several phases to perform topic modeling,
as follows:
1.
Before we can process any documents, we'll need to create a processing pipeline.
This deines how the documents should be read, tokenized, normalized, and so on:
(defn make-pipe-list []
(InstanceList.
(SerialPipes.
[(Input2CharSequence. "UTF-8")
(CharSequence2TokenSequence.
#"\p{L}[\p{L}\p{P}]+\p{L}")
(TokenSequenceLowercase.)
(TokenSequenceRemoveStopwords. false false)
(TokenSequence2FeatureSequence.)])))
2.
Now, we'll create a function that takes the processing pipeline and a directory of data
iles, and it will run the iles through the pipeline. This returns an InstanceList ,
which is a collection of documents along with their metadata:
(defn add-directory-files [instance-list corpus-dir]
(.addThruPipe
instance-list
(FileListIterator.
(.listFiles (io/file corpus-dir))
(reify FileFilter
(accept [this pathname] true))
#"/([^/]*).txt$"
true)))
3.
The last function takes the InstanceList and some other parameters and trains a
topic model, which it returns:
(defn train-model
([instances] (train-model 100 4 50 instances))
([num-topics num-threads num-iterations instances]
(doto (ParallelTopicModel. num-topics 1.0 0.01)
(.addInstances instances)
(.setNumThreads num-threads)
(.setNumIterations num-iterations)
(.estimate))))
 
Search WWH ::




Custom Search