Database Reference
In-Depth Information
How it works…
When you glance at the results, you can see that it appears to have performed well. We need
to look into the document and see what it missed to be certain, of course.
The process to use this is similar to the tokenizer or sentence chunker: load the model from a
ile and then call the result as a function.
Mapping documents to a sparse vector
space representation
Many text algorithms deal with vector space representations of the documents. This means
that the documents are normalized into vectors. Each individual token type is assigned one
position across all the documents' vectors. For instance,
text
might have position 42, so index
42 in all the document vectors will have the frequency (or other value) of the word
text
.
However, most documents won't have anything for most words. This makes them sparse
vectors, and we can use more eficient formats for them.
of sparse vectors. For this recipe, we'll see how to read a collection of documents into these.
Getting ready…
For this recipe, we'll need the following in our
project.clj
ile:
(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
:dependencies [[org.clojure/clojure "1.6.0"]
[clojure-opennlp "0.3.2"]
[colt/colt "1.2.0"]])
For our script or REPL, we'll need these libraries:
(require '[clojure.set :as set]
'[opennlp.nlp :as nlp])
(import [cern.colt.matrix DoubleFactory2D])
From the previous recipes, we'll use several functions. From the
Tokenizing text
recipe, we'll
use
tokenize
and
normalize
, and from the
Scaling document frequencies with TF-IDF
recipe, we'll use
get-corpus-terms
.
For the data, we'll again use the State of the Union address that we irst saw in the
Scaling
document frequencies in TF-IDF
recipe. You can download these from
http://www.
ericrochester.com/clj-data-analysis/data/sotu.tar.gz
. I've unpacked the
data from this ile into the
sotu
directory.