Working with Unstructured and Textual Data - Clojure Data Analysis

Database Reference

In-Depth Information

[comprehending 1.8353230604578765 1]

[novelty 1.8353230604578765 1]

[well-digested 1.8353230604578765 1]

[cherishing 1.8353230604578765 1]

[cool 1.7574531294111173 1]

You can see that these words all occur once in this document, and in fact, intimating and

licentiousness are only found in the irst SOTU, and all 10 of these words are found in six or

fewer addresses.

Finding people, places, and things with

Named Entity Recognition

One thing that's fairly easy to pull out of documents is named items. This includes things such

as people's names, organizations, locations, and dates. These algorithms are called Named

Entity Recognition (NER), and while they are not perfect, they're generally pretty good. Error

rates under 0.1 are normal.

The OpenNLP library has classes to perform NER, and depending on what you train them with,

they will identify people, locations, dates, or a number of other things. The clojure-opennlp

library also exposes these classes in a good, Clojure-friendly way.

Getting ready

We'll continue building on the previous recipes in this chapter. Because of this, we'll use the

same project.clj ile:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"

:dependencies [[org.clojure/clojure "1.6.0"]

[clojure-opennlp "0.3.2"]])

From the Tokenizing text recipe, we'll use tokenize , and from the Focusing on content words

with stoplists recipe, we'll use normalize .

Pretrained models can be downloaded from http://opennlp.sourceforge.net/

models-1.5/ . I downloaded en-ner-person.bin , en-ner-organization.bin , en-

ner-date.bin , en-ner-location.bin , and en-ner-money.bin . Then, I saved these

models in models/ .

Search WWH ::

Custom Search

Home