Database Reference
In-Depth Information
Introduction
We've been talking about all of the data that's out there in the world. However, structured or
semistructured data—the kind you'd ind in spreadsheets or in tables on web pages—is vastly
overshadowed by the unstructured data that's being produced. This includes news articles,
blog posts, tweets, Hacker News discussions, StackOverlow questions and responses, and
any other natural text that seems like it is being generated by the petabytes daily.
This unstructured content contains information. It has rich, subtle, and nuanced data, but
getting it is dificult. In this chapter, we'll explore some ways to get some of the information
out of unstructured data. It won't be fully nuanced and it will be very rough, but it's a start.
We've already looked at how to acquire textual data. In Chapter 1 , Importing Data for Analysis ,
we looked at this in the Scraping textual data from web pages recipe. Still, the Web is going to
be your best source for data.
Tokenizing text
Before we can do any real analysis of a text or a corpus of texts, we have to identify the
words in the text. This process is called tokenization. The output of this process is a list of
words, and possibly includes punctuation in a text. This is different from tokenizing formal
languages such as programming languages: it is meant to work with natural languages and its
results are less structured.
It's easy to write your own tokenizer, but there are a lot of edge and corner cases to take into
consideration and account for. It's also easy to include a natural language processing (NLP)
library that includes one or more tokenizers. In this recipe, we'll use the OpenNLP ( http://
opennlp.apache.org/ ) and its Clojure wrapper ( https://clojars.org/clojure-
opennlp ) .
Getting ready
We'll need to include the clojure-opennlp in our project.clj ile:
(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
:dependencies [[org.clojure/clojure "1.6.0"]
[clojure-opennlp "0.3.2"]])
We will also need to require it into the current namespace, as follows:
(require '[opennlp.nlp :as nlp])
Finally, we'll download a model for a statistical tokenizer. I downloaded all of the iles from
http://opennlp.sourceforge.net/models-1.5/ . I then saved them into models/ .
 
Search WWH ::




Custom Search