Database Reference
In-Depth Information
3.
Now, we'll take a short detour in order to optimize prematurely. In this case, the IDF
values will be the same for each term across the corpus, but if we're not careful, we'll
code this so that we're computing these terms for each document. For example, the
IDF value for
the
will be the same, no matter how many times
the
actually occurs in
the current document. We can precompute these and cache them. However, before
we can do that, we'll need to obtain the set of all the terms in the corpus. The
get-
corpus-terms
function does this, as shown here:
(defn get-corpus-terms [corpus]
(->> corpus
(map #(set (keys %)))
(reduce set/union #{})))
4.
The
get-idf-cache
function takes a corpus, extracts its term set, and returns a
hashmap associating the terms with their IDF values, as follows:
(defn get-idf-cache [corpus]
(reduce #(assoc %1 %2 (idf corpus %2)) {}
(get-corpus-terms corpus)))
5.
Now, the
tf-idf
function is our lowest-level function that combines
tf
and
idf
.
It just takes the raw parameters, including the cached IDF value, and performs the
necessary calculations:
(defn tf-idf [idf-value freq max-freq]
(* (tf freq max-freq) idf-value))
6.
The
tf-idf-pair
function sits immediately on top of
tf-idf
. It gets the IDF value
from the cache, and for one of its parameters, it takes a term-raw frequency pair. It
returns the pair with the frequency being the TF-IDF for that term:
(defn tf-idf-pair [idf-cache max-freq pair]
(let [[term freq] pair]
[term (tf-idf (idf-cache term) freq max-freq)]))
7.
Finally, the
tf-idf-freqs
function controls the entire process. It takes an IDF
cache and a frequency hashmap, and it scales the frequencies in the hashmap into
their TF-IDF equivalents, as follows:
(defn tf-idf-freqs [idf-cache freqs]
(let [max-freq (reduce max 0 (vals freqs))]
(->> freqs
(map #(tf-idf-pair idf-cache max-freq %))
(into {}))))