Working with Unstructured and Textual Data - Clojure Data Analysis

Database Reference

In-Depth Information

How to do it…

We'll need to work the documents through several phases to perform topic modeling,

as follows:

1.

Before we can process any documents, we'll need to create a processing pipeline.

This deines how the documents should be read, tokenized, normalized, and so on:

(defn make-pipe-list []

(InstanceList.

(SerialPipes.

[(Input2CharSequence. "UTF-8")

(CharSequence2TokenSequence.

#"\p{L}[\p{L}\p{P}]+\p{L}")

(TokenSequenceLowercase.)

(TokenSequenceRemoveStopwords. false false)

(TokenSequence2FeatureSequence.)])))

2.

Now, we'll create a function that takes the processing pipeline and a directory of data

iles, and it will run the iles through the pipeline. This returns an InstanceList ,

which is a collection of documents along with their metadata:

(defn add-directory-files [instance-list corpus-dir]

(.addThruPipe

instance-list

(FileListIterator.

(.listFiles (io/file corpus-dir))

(reify FileFilter

(accept [this pathname] true))

#"/([^/]*).txt$"

true)))

3.

The last function takes the InstanceList and some other parameters and trains a

topic model, which it returns:

(defn train-model

([instances] (train-model 100 4 50 instances))

([num-topics num-threads num-iterations instances]

(doto (ParallelTopicModel. num-topics 1.0 0.01)

(.addInstances instances)

(.setNumThreads num-threads)

(.setNumIterations num-iterations)

(.estimate))))

Search WWH ::

Custom Search

Home