Working with Unstructured and Textual Data - Clojure Data Analysis

Database Reference

In-Depth Information

2. The stoplist will actually be represented by a Clojure set. This will make iltering a lot

easier. The load-stopwords function will read in the ile, break it into lines, and

fold them into a set, as follows:

(defn load-stopwords [filename]

(with-open [r (io/reader filename)]

(set (doall (line-seq r)))))

(def is-stopword (load-stopwords "stopwords/english"))

3.

Finally, we can load the tokens. This will break the input into sentences. Then, it will

tokenize each sentence, normalize its tokens, and remove its stopwords, as follows:

(def tokens

(map #(remove is-stopword (normalize (tokenize %)))

(get-sentences

"I never saw a Purple Cow.

I never hope to see one.

But I can tell you, anyhow.

I'd rather see than be one.")))

Now, you can see that the tokens returned are more focused on the content and are missing

all of the function words:

user=> (pprint tokens)

(("never" "saw" "purple" "cow" ".")

("never" "hope" "see" "one" ".")

("tell" "," "anyhow" ".")

("'d" "rather" "see" "one" "."))

Getting document frequencies

One common and useful metric to work with text corpora is to get the counts of the tokens in

the documents. This can be done quite easily by leveraging standard Clojure functions.

Let's see how.

Getting ready

We'll continue building on the previous recipes in this chapter. Because of that, we'll use the

same project.clj ile:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"

:dependencies [[org.clojure/clojure "1.6.0"]

[clojure-opennlp "0.3.2"]])

Search WWH ::

Custom Search

Home