Database Reference
In-Depth Information
How to do it…
The following image shows the group of functions that we'll be coding in this recipe:
So, in English, the function for tf represents the frequency of the term t in the document d ,
scaled by the maximum term frequency in d . In other words, unless you're using a stoplist, this
will almost always be the frequency of the term the .
The function for idf is the log of the number of documents ( N ) divided by the number of
documents that contain the term t .
These equations break the problem down well. We can write a function for each one of these.
We'll also create a number of other functions to help us along. Let's get started:
1. For the irst function, we'll implement the tf component of the equation. This is a
transparent translation of the tf function from earlier. It takes a term's frequency and
the maximum term frequency from the same document, as follows:
(defn tf [term-freq max-freq]
(+ 0.5 (/ (* 0.5 term-freq) max-freq)))
2.
Now, we'll do the most basic implementation of idf . Like the tf function used
earlier, it's a close match to the idf equation:
(defn idf [corpus term]
(Math/log
(/ (count corpus)
(inc (count
(filter #(contains? % term) corpus))))))
 
Search WWH ::




Custom Search