Working with Unstructured and Textual Data - Clojure Data Analysis

Database Reference

In-Depth Information

How it works…

When you glance at the results, you can see that it appears to have performed well. We need

to look into the document and see what it missed to be certain, of course.

The process to use this is similar to the tokenizer or sentence chunker: load the model from a

ile and then call the result as a function.

Mapping documents to a sparse vector

space representation

Many text algorithms deal with vector space representations of the documents. This means

that the documents are normalized into vectors. Each individual token type is assigned one

position across all the documents' vectors. For instance, text might have position 42, so index

42 in all the document vectors will have the frequency (or other value) of the word text .

However, most documents won't have anything for most words. This makes them sparse

vectors, and we can use more eficient formats for them.

The Colt library ( http://acs.lbl.gov/ACSSoftware/colt/ ) contains implementations

of sparse vectors. For this recipe, we'll see how to read a collection of documents into these.

Getting ready…

For this recipe, we'll need the following in our project.clj ile:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"

:dependencies [[org.clojure/clojure "1.6.0"]

[clojure-opennlp "0.3.2"]

[colt/colt "1.2.0"]])

For our script or REPL, we'll need these libraries:

(require '[clojure.set :as set]

'[opennlp.nlp :as nlp])

(import [cern.colt.matrix DoubleFactory2D])

From the previous recipes, we'll use several functions. From the Tokenizing text recipe, we'll

use tokenize and normalize , and from the Scaling document frequencies with TF-IDF

recipe, we'll use get-corpus-terms .

For the data, we'll again use the State of the Union address that we irst saw in the Scaling

document frequencies in TF-IDF recipe. You can download these from http://www.

ericrochester.com/clj-data-analysis/data/sotu.tar.gz . I've unpacked the

data from this ile into the sotu directory.

Search WWH ::

Custom Search

Home