Database Reference
In-Depth Information
Scaling document frequencies with TF-IDF
In the last few recipes, we've seen how to generate term frequencies and scale them by the
size of the document so that the frequencies from two different documents can be compared.
Term frequencies also have another problem. They don't tell you how important a term is,
relative to all of the documents in the corpus.
To address this, we will use term frequency-inverse document frequency (TF-IDF).
This metric scales the term's frequency in a document by the term's frequency in the
entire corpus.
In this recipe, we'll assemble the parts needed to implement TF-IDF.
Getting ready
We'll continue building on the previous recipes in this chapter. Because of that, we'll use the
same project.clj ile:
(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
:dependencies [[org.clojure/clojure "1.6.0"]
[clojure-opennlp "0.3.2"]])
We'll also use two functions that we've created earlier in this chapter. From the Tokenizing text
recipe, we'll use tokenize . From the Focusing on content words with stoplists recipe, we'll
use normalize .
Aside from the imports required for these two functions, we'll also want to have this available
in our source code or REPL:
(require '[clojure.set :as set])
For this recipe, we'll also need more data than we've been using. For this, we'll use a corpus
of State of the Union (SOTU) addresses from United States presidents over time. These are
yearly addresses that presidents make where they talk about the events of the past year
and outline their priorities over the next twelve months. You can download these from
http://www.ericrochester.com/clj-data-analysis/data/sotu.tar.gz .
I've unpacked the data from this ile into the sotu directory.
 
Search WWH ::




Custom Search