Database Reference
In-Depth Information
[comprehending 1.8353230604578765 1]
[novelty 1.8353230604578765 1]
[well-digested 1.8353230604578765 1]
[cherishing 1.8353230604578765 1]
[cool 1.7574531294111173 1]
You can see that these words all occur once in this document, and in fact, intimating and
licentiousness are only found in the irst SOTU, and all 10 of these words are found in six or
fewer addresses.
Finding people, places, and things with
Named Entity Recognition
One thing that's fairly easy to pull out of documents is named items. This includes things such
as people's names, organizations, locations, and dates. These algorithms are called Named
Entity Recognition (NER), and while they are not perfect, they're generally pretty good. Error
rates under 0.1 are normal.
The OpenNLP library has classes to perform NER, and depending on what you train them with,
they will identify people, locations, dates, or a number of other things. The clojure-opennlp
library also exposes these classes in a good, Clojure-friendly way.
Getting ready
We'll continue building on the previous recipes in this chapter. Because of this, we'll use the
same project.clj ile:
(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
:dependencies [[org.clojure/clojure "1.6.0"]
[clojure-opennlp "0.3.2"]])
From the Tokenizing text recipe, we'll use tokenize , and from the Focusing on content words
with stoplists recipe, we'll use normalize .
Pretrained models can be downloaded from http://opennlp.sourceforge.net/
models-1.5/ . I downloaded en-ner-person.bin , en-ner-organization.bin , en-
ner-date.bin , en-ner-location.bin , and en-ner-money.bin . Then, I saved these
models in models/ .
 
Search WWH ::




Custom Search