Database Reference
In-Depth Information
How to do it…
For this recipe, we'll go from preprocessing the data to performing the counts in parallel and
looking at the results.
1.
Let's get a sequence of the iles to process:
(def input-files
(filter #(.isFile %)
(file-seq (File. "./data/brown"))))
2. Now, we'll deine some references: finished will indicate whether processing is
done or not, total-docs and total-words will keep running totals, freqs will
map the tokens to their frequencies as a whole, and running-report is an agent
that contains the current state of the report for the term we're interested in:
(def finished (ref false))
(def total-docs (ref 0))
(def total-words (ref 0))
(def freqs (ref {}))
(def running-report
(agent {:term nil,
:frequency 0,
:ratio 0.0}))
3. Let's write the tokenizer. The text in the Brown corpus iles look like this:
The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd
Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj
primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/''
that/cs any/dti irregularities/nns took/vbd place/nn ./.
We're not interested in the parts of speech, so our tokenizer will remove them and
covert each token to a lowercase keyword:
(defn tokenize-brown [input-str]
(->> (string/split input-str #"\s+")
(map #(first (string/split % #"/" 2)))
(filter #(> (count %) 0))
(map string/lower-case)
(map keyword)))
4.
Now, let's write a utility function that increments the frequency map for a token:
(defn accum-freq [m token]
(assoc m token (inc (m token 0))))
 
Search WWH ::




Custom Search