Database Reference
In-Depth Information
How to do it…
The following image shows the group of functions that we'll be coding in this recipe:
So, in English, the function for
tf
represents the frequency of the term
t
in the document
d
,
scaled by the maximum term frequency in
d
. In other words, unless you're using a stoplist, this
will almost always be the frequency of the term
the
.
The function for
idf
is the log of the number of documents (
N
) divided by the number of
documents that contain the term
t
.
These equations break the problem down well. We can write a function for each one of these.
We'll also create a number of other functions to help us along. Let's get started:
1. For the irst function, we'll implement the
tf
component of the equation. This is a
transparent translation of the
tf
function from earlier. It takes a term's frequency and
the maximum term frequency from the same document, as follows:
(defn tf [term-freq max-freq]
(+ 0.5 (/ (* 0.5 term-freq) max-freq)))
2.
Now, we'll do the most basic implementation of
idf
. Like the
tf
function used
earlier, it's a close match to the
idf
equation:
(defn idf [corpus term]
(Math/log
(/ (count corpus)
(inc (count
(filter #(contains? % term) corpus))))))