Database Reference
In-Depth Information
Now, we have all of the pieces in place to use this.
1.
For this example, we'll read all of the State of the Union addresses into a sequence
of raw frequency hashmaps. This will be bound to the name corpus :
(def corpus
(->> "sotu"
(java.io.File.)
(.list)
(map #(str "sotu/" %))
(map slurp)
(map tokenize)
(map normalize)
(map frequencies)))
2.
We'll use these frequencies to create the IDF cache and bind it to the name cache :
(def cache (get-idf-cache corpus))
3.
Now, actually calling tf-idf-freqs on these frequencies is straightforward, as
shown here:
(def freqs (map #(tf-idf-freqs cache %) corpus))
How it works…
TF-IDF scales the raw token frequencies by the number of documents they occur in within the
corpus. This identiies the distinguishing words for each document. After all, if the word occurs
in almost every document, it won't be a distinguishing word for any document. However, if a
word is only found in one document, it helps to distinguish that document.
For example, here are the 10 most distinguishing words from the irst SOTU address:
user=> (doseq [[term idf-freq] (->> freqs
first
(sort-by second)
reverse
(take 10))]
(println [term idf-freq ((first corpus) term)]))
[intimating 2.39029215473352 1]
[licentiousness 2.39029215473352 1]
[discern 2.185469574348983 1]
[inviolable 2.0401456408424132 1]
[specify 1.927423640693998 1]
 
Search WWH ::




Custom Search