Database Reference
In-Depth Information
Scaling document frequencies by
document size
While raw token frequencies can be useful, they often have one major problem: comparing
frequencies with different documents is complicated if the document sizes are not the same.
If the word customer appears 23 times in a 500-word document and it appears 40 times in
a 1,000-word document, which one do you think is more focused on that word? It's dificult
to say.
To work around this, it's common to scale the tokens frequencies for each document by the
size of the document. That's what we'll do in this recipe.
Getting ready
We'll continue building on the previous recipes in this chapter. Because of that, we'll use the
same project.clj ile:
(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
:dependencies [[org.clojure/clojure "1.6.0"]
[clojure-opennlp "0.3.2"]])
We'll use the token frequencies that we igured from the Getting document frequencies recipe.
We'll keep them bound to the name token-freqs .
How to do it…
The function used to perform this scaling is fairly simple. It calculates the total number
of tokens by adding the values from the frequency hashmap and then it walks over the
hashmap again, scaling each frequency, as shown here:
(defn scale-by-total [freqs]
(let [total (reduce + 0 (vals freqs))]
(->> freqs
(map #(vector (first %) (/ (second %) total)))
(into {}))))
We can now use this on token-freqs from the last recipe:
user=> (pprint (scale-by-total token-freqs))
{"see" 2/19,
"purple" 1/19,
"tell" 1/19,
 
Search WWH ::




Custom Search