Working with Unstructured and Textual Data - Clojure Data Analysis

Database Reference

In-Depth Information

Now, we have all of the pieces in place to use this.

1.

For this example, we'll read all of the State of the Union addresses into a sequence

of raw frequency hashmaps. This will be bound to the name corpus :

(def corpus

(->> "sotu"

(java.io.File.)

(.list)

(map #(str "sotu/" %))

(map slurp)

(map tokenize)

(map normalize)

(map frequencies)))

2.

We'll use these frequencies to create the IDF cache and bind it to the name cache :

(def cache (get-idf-cache corpus))

3.

Now, actually calling tf-idf-freqs on these frequencies is straightforward, as

shown here:

(def freqs (map #(tf-idf-freqs cache %) corpus))

How it works…

TF-IDF scales the raw token frequencies by the number of documents they occur in within the

corpus. This identiies the distinguishing words for each document. After all, if the word occurs

in almost every document, it won't be a distinguishing word for any document. However, if a

word is only found in one document, it helps to distinguish that document.

For example, here are the 10 most distinguishing words from the irst SOTU address:

user=> (doseq [[term idf-freq] (->> freqs

first

(sort-by second)

reverse

(take 10))]

(println [term idf-freq ((first corpus) term)]))

[intimating 2.39029215473352 1]

[licentiousness 2.39029215473352 1]

[discern 2.185469574348983 1]

[inviolable 2.0401456408424132 1]

[specify 1.927423640693998 1]

Search WWH ::

Custom Search

Home