Database Reference
In-Depth Information
3.
Now, we'll take a short detour in order to optimize prematurely. In this case, the IDF
values will be the same for each term across the corpus, but if we're not careful, we'll
code this so that we're computing these terms for each document. For example, the
IDF value for the will be the same, no matter how many times the actually occurs in
the current document. We can precompute these and cache them. However, before
we can do that, we'll need to obtain the set of all the terms in the corpus. The get-
corpus-terms function does this, as shown here:
(defn get-corpus-terms [corpus]
(->> corpus
(map #(set (keys %)))
(reduce set/union #{})))
4.
The get-idf-cache function takes a corpus, extracts its term set, and returns a
hashmap associating the terms with their IDF values, as follows:
(defn get-idf-cache [corpus]
(reduce #(assoc %1 %2 (idf corpus %2)) {}
(get-corpus-terms corpus)))
5.
Now, the tf-idf function is our lowest-level function that combines tf and idf .
It just takes the raw parameters, including the cached IDF value, and performs the
necessary calculations:
(defn tf-idf [idf-value freq max-freq]
(* (tf freq max-freq) idf-value))
6.
The tf-idf-pair function sits immediately on top of tf-idf . It gets the IDF value
from the cache, and for one of its parameters, it takes a term-raw frequency pair. It
returns the pair with the frequency being the TF-IDF for that term:
(defn tf-idf-pair [idf-cache max-freq pair]
(let [[term freq] pair]
[term (tf-idf (idf-cache term) freq max-freq)]))
7.
Finally, the tf-idf-freqs function controls the entire process. It takes an IDF
cache and a frequency hashmap, and it scales the frequencies in the hashmap into
their TF-IDF equivalents, as follows:
(defn tf-idf-freqs [idf-cache freqs]
(let [max-freq (reduce max 0 (vals freqs))]
(->> freqs
(map #(tf-idf-pair idf-cache max-freq %))
(into {}))))
 
Search WWH ::




Custom Search