Database Reference
In-Depth Information
Scaling document frequencies by
document size
While raw token frequencies can be useful, they often have one major problem: comparing
frequencies with different documents is complicated if the document sizes are not the same.
If the word
customer
appears 23 times in a 500-word document and it appears 40 times in
a 1,000-word document, which one do you think is more focused on that word? It's dificult
to say.
To work around this, it's common to scale the tokens frequencies for each document by the
size of the document. That's what we'll do in this recipe.
Getting ready
We'll continue building on the previous recipes in this chapter. Because of that, we'll use the
same
project.clj
ile:
(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
:dependencies [[org.clojure/clojure "1.6.0"]
[clojure-opennlp "0.3.2"]])
We'll use the token frequencies that we igured from the
Getting document frequencies
recipe.
We'll keep them bound to the name
token-freqs
.
How to do it…
The function used to perform this scaling is fairly simple. It calculates the total number
of tokens by adding the values from the frequency hashmap and then it walks over the
hashmap again, scaling each frequency, as shown here:
(defn scale-by-total [freqs]
(let [total (reduce + 0 (vals freqs))]
(->> freqs
(map #(vector (first %) (/ (second %) total)))
(into {}))))
We can now use this on
token-freqs
from the last recipe:
user=> (pprint (scale-by-total token-freqs))
{"see" 2/19,
"purple" 1/19,
"tell" 1/19,