Working with Unstructured and Textual Data - Clojure Data Analysis

Database Reference

In-Depth Information

Introduction

We've been talking about all of the data that's out there in the world. However, structured or

semistructured data—the kind you'd ind in spreadsheets or in tables on web pages—is vastly

overshadowed by the unstructured data that's being produced. This includes news articles,

blog posts, tweets, Hacker News discussions, StackOverlow questions and responses, and

any other natural text that seems like it is being generated by the petabytes daily.

This unstructured content contains information. It has rich, subtle, and nuanced data, but

getting it is dificult. In this chapter, we'll explore some ways to get some of the information

out of unstructured data. It won't be fully nuanced and it will be very rough, but it's a start.

We've already looked at how to acquire textual data. In Chapter 1 , Importing Data for Analysis ,

we looked at this in the Scraping textual data from web pages recipe. Still, the Web is going to

be your best source for data.

Tokenizing text

Before we can do any real analysis of a text or a corpus of texts, we have to identify the

words in the text. This process is called tokenization. The output of this process is a list of

words, and possibly includes punctuation in a text. This is different from tokenizing formal

languages such as programming languages: it is meant to work with natural languages and its

results are less structured.

It's easy to write your own tokenizer, but there are a lot of edge and corner cases to take into

consideration and account for. It's also easy to include a natural language processing (NLP)

library that includes one or more tokenizers. In this recipe, we'll use the OpenNLP ( http://

opennlp.apache.org/ ) and its Clojure wrapper ( https://clojars.org/clojure-

opennlp ) .

Getting ready

We'll need to include the clojure-opennlp in our project.clj ile:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"

:dependencies [[org.clojure/clojure "1.6.0"]

[clojure-opennlp "0.3.2"]])

We will also need to require it into the current namespace, as follows:

(require '[opennlp.nlp :as nlp])

Finally, we'll download a model for a statistical tokenizer. I downloaded all of the iles from

http://opennlp.sourceforge.net/models-1.5/ . I then saved them into models/ .

Search WWH ::

Custom Search

Home