Working with Unstructured and Textual Data - Clojure Data Analysis

Database Reference

In-Depth Information

3.

Now, we'll take a short detour in order to optimize prematurely. In this case, the IDF

values will be the same for each term across the corpus, but if we're not careful, we'll

code this so that we're computing these terms for each document. For example, the

IDF value for the will be the same, no matter how many times the actually occurs in

the current document. We can precompute these and cache them. However, before

we can do that, we'll need to obtain the set of all the terms in the corpus. The get-

corpus-terms function does this, as shown here:

(defn get-corpus-terms [corpus]

(->> corpus

(map #(set (keys %)))

(reduce set/union #{})))

4.

The get-idf-cache function takes a corpus, extracts its term set, and returns a

hashmap associating the terms with their IDF values, as follows:

(defn get-idf-cache [corpus]

(reduce #(assoc %1 %2 (idf corpus %2)) {}

(get-corpus-terms corpus)))

5.

Now, the tf-idf function is our lowest-level function that combines tf and idf .

It just takes the raw parameters, including the cached IDF value, and performs the

necessary calculations:

(defn tf-idf [idf-value freq max-freq]

(* (tf freq max-freq) idf-value))

6.

The tf-idf-pair function sits immediately on top of tf-idf . It gets the IDF value

from the cache, and for one of its parameters, it takes a term-raw frequency pair. It

returns the pair with the frequency being the TF-IDF for that term:

(defn tf-idf-pair [idf-cache max-freq pair]

(let [[term freq] pair]

[term (tf-idf (idf-cache term) freq max-freq)]))

7.

Finally, the tf-idf-freqs function controls the entire process. It takes an IDF

cache and a frequency hashmap, and it scales the frequencies in the hashmap into

their TF-IDF equivalents, as follows:

(defn tf-idf-freqs [idf-cache freqs]

(let [max-freq (reduce max 0 (vals freqs))]

(->> freqs

(map #(tf-idf-pair idf-cache max-freq %))

(into {}))))

Search WWH ::

Custom Search

Home