Working with Unstructured and Textual Data - Clojure Data Analysis

Database Reference

In-Depth Information

We'll also use tokenize , get-sentences , normalize , load-stopwords , and is-

stopword from the earlier recipes.

We'll also use the value of the tokens that we saw in the Focusing on content words with

stoplists recipe. Here it is again:

(def tokens

(map #(remove is-stopword (normalize (tokenize %)))

(get-sentences

"I never saw a Purple Cow.

I never hope to see one.

But I can tell you, anyhow.

I'd rather see than be one.")))

How to do it…

Of course, the standard function to count items in a sequence is frequencies . We can use

this to get the token counts for each sentence, but then we'll also want to fold those into a

frequency table using merge-with :

(def token-freqs

(apply merge-with + (map frequencies tokens)))

We can print or query this table to get the count for any token or piece of punctuation,

as follows:

user=> (pprint token-freqs)

{"see" 2,

"purple" 1,

"tell" 1,

"cow" 1,

"anyhow" 1,

"hope" 1,

"never" 2,

"saw" 1,

"'d" 1,

"." 4,

"one" 2,

"," 1,

"rather" 1}

Search WWH ::

Custom Search

Home