Working with Unstructured and Textual Data - Clojure Data Analysis

Database Reference

In-Depth Information

On the other hand, chair has a meaning beyond what it's doing in the sentence, and in fact,

it's role in the sentence will vary (subject, direct object, and so on).

You don't always want to use stopwords since they throw away information. However, since

function words are more frequent than content words, sometimes focusing on the content

words can add clarity to your analysis and its output. Also, they can speed up the processing.

Getting ready

This recipe will build on the work that we've done so far in this chapter. As such, it will use the

same project.clj ile that we used in the Tokenizing text and Finding sentences recipes:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"

:dependencies [[org.clojure/clojure "1.6.0"]

[clojure-opennlp "0.3.2"]])

However, we'll use a slightly different set of requirements for this recipe:

(require '[opennlp.nlp :as nlp]

'[clojure.java.io :as io])

We'll also need to have a list of stopwords. You can easily create your own list, but for the

purpose of this recipe, we'll use the English stopword list included with the Natural Language

Toolkit ( http://www.nltk.org/ ) . You can download this from http://nltk.github.

com/nltk_data/packages/corpora/stopwords.zip . Unzip it into your project directory

and make sure that the stopwords/english ile exists.

We'll also use the tokenize and get-sentences functions that we created in the previous

two recipes.

How to do it…

We'll need to create a function in order to process and normalize the tokens. Also, we'll need

a utility function to load the stopword list. Once these are in place, we'll see how to use the

stopwords. To do this, perform the following steps:

1.

The words in the stopword list have been lowercased. We can also do this with the

tokens that we create. We'll use the normalize function to handle the lowercasing

of each token:

(defn normalize [token-seq]

(map #(.toLowerCase %) token-seq))

Search WWH ::

Custom Search

Home