Database Reference
In-Depth Information
We'll also use
tokenize
,
get-sentences
,
normalize
,
load-stopwords
, and
is-
stopword
from the earlier recipes.
We'll also use the value of the tokens that we saw in the
Focusing on content words with
stoplists
recipe. Here it is again:
(def tokens
(map #(remove is-stopword (normalize (tokenize %)))
(get-sentences
"I never saw a Purple Cow.
I never hope to see one.
But I can tell you, anyhow.
I'd rather see than be one.")))
How to do it…
Of course, the standard function to count items in a sequence is
frequencies
. We can use
this to get the token counts for each sentence, but then we'll also want to fold those into a
frequency table using
merge-with
:
(def token-freqs
(apply merge-with + (map frequencies tokens)))
We can print or query this table to get the count for any token or piece of punctuation,
as follows:
user=> (pprint token-freqs)
{"see" 2,
"purple" 1,
"tell" 1,
"cow" 1,
"anyhow" 1,
"hope" 1,
"never" 2,
"saw" 1,
"'d" 1,
"." 4,
"one" 2,
"," 1,
"rather" 1}