Working with Unstructured and Textual Data - Clojure Data Analysis

Database Reference

In-Depth Information

Performing topic modeling with MALLET

Previously in this chapter, we looked at a number of ways to programmatically see what's

present in documents. We saw how to identify people, places, dates, and other things in

documents. We saw how to break things up into sentences.

Another, more sophisticated way to discover what's in a document is to use topic modeling.

Topic modeling attempts to identify a set of topics that are contained in the document

collection. Each topic is a cluster of words that are used together throughout the corpus.

These clusters are found in individual documents to varying degrees, and a document is

composed of several topics to varying extents. We'll take a look at this in more detail in the

explanation for this recipe.

To perform topic modeling, we'll use MALLET ( http://mallet.cs.umass.edu/ ) .

This is a library and utility that implements topic modeling in addition to several other

document classiication algorithms.

Getting ready

For this recipe, we'll need these lines in our project.clj ile:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"

:dependencies [[org.clojure/clojure "1.6.0"]

[cc.mallet/mallet "2.0.7"]])

Our imports and requirements for this are pretty extensive too, as shown here:

(require '[clojure.java.io :as io])

(import [cc.mallet.util.*]

[cc.mallet.types InstanceList]

[cc.mallet.pipe

Input2CharSequence TokenSequenceLowercase

CharSequence2TokenSequence SerialPipes

TokenSequenceRemoveStopwords

TokenSequence2FeatureSequence]

[cc.mallet.pipe.iterator FileListIterator]

[cc.mallet.topics ParallelTopicModel]

[java.io FileFilter])

Again, we'll use the State of the Union addresses that we've already seen several times

in this chapter. You can download these from http://www.ericrochester.com/clj-

data-analysis/data/sotu.tar.gz . I've unpacked the data from this ile into the

sotu directory.

Search WWH ::

Custom Search

Home