Database Reference
In-Depth Information
How it works…
When you glance at the results, you can see that it appears to have performed well. We need
to look into the document and see what it missed to be certain, of course.
The process to use this is similar to the tokenizer or sentence chunker: load the model from a
ile and then call the result as a function.
Mapping documents to a sparse vector
space representation
Many text algorithms deal with vector space representations of the documents. This means
that the documents are normalized into vectors. Each individual token type is assigned one
position across all the documents' vectors. For instance, text might have position 42, so index
42 in all the document vectors will have the frequency (or other value) of the word text .
However, most documents won't have anything for most words. This makes them sparse
vectors, and we can use more eficient formats for them.
The Colt library ( http://acs.lbl.gov/ACSSoftware/colt/ ) contains implementations
of sparse vectors. For this recipe, we'll see how to read a collection of documents into these.
Getting ready…
For this recipe, we'll need the following in our project.clj ile:
(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
:dependencies [[org.clojure/clojure "1.6.0"]
[clojure-opennlp "0.3.2"]
[colt/colt "1.2.0"]])
For our script or REPL, we'll need these libraries:
(require '[clojure.set :as set]
'[opennlp.nlp :as nlp])
(import [cern.colt.matrix DoubleFactory2D])
From the previous recipes, we'll use several functions. From the Tokenizing text recipe, we'll
use tokenize and normalize , and from the Scaling document frequencies with TF-IDF
recipe, we'll use get-corpus-terms .
For the data, we'll again use the State of the Union address that we irst saw in the Scaling
document frequencies in TF-IDF recipe. You can download these from http://www.
ericrochester.com/clj-data-analysis/data/sotu.tar.gz . I've unpacked the
data from this ile into the sotu directory.
 
Search WWH ::




Custom Search