Working with Unstructured and Textual Data - Clojure Data Analysis

Database Reference

In-Depth Information

Scaling document frequencies by

document size

While raw token frequencies can be useful, they often have one major problem: comparing

frequencies with different documents is complicated if the document sizes are not the same.

If the word customer appears 23 times in a 500-word document and it appears 40 times in

a 1,000-word document, which one do you think is more focused on that word? It's dificult

to say.

To work around this, it's common to scale the tokens frequencies for each document by the

size of the document. That's what we'll do in this recipe.

Getting ready

We'll continue building on the previous recipes in this chapter. Because of that, we'll use the

same project.clj ile:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"

:dependencies [[org.clojure/clojure "1.6.0"]

[clojure-opennlp "0.3.2"]])

We'll use the token frequencies that we igured from the Getting document frequencies recipe.

We'll keep them bound to the name token-freqs .

How to do it…

The function used to perform this scaling is fairly simple. It calculates the total number

of tokens by adding the values from the frequency hashmap and then it walks over the

hashmap again, scaling each frequency, as shown here:

(defn scale-by-total [freqs]

(let [total (reduce + 0 (vals freqs))]

(->> freqs

(map #(vector (first %) (/ (second %) total)))

(into {}))))

We can now use this on token-freqs from the last recipe:

user=> (pprint (scale-by-total token-freqs))

{"see" 2/19,

"purple" 1/19,

"tell" 1/19,

Search WWH ::

Custom Search

Home