Working with Unstructured and Textual Data - Clojure Data Analysis

Database Reference

In-Depth Information

Scaling document frequencies with TF-IDF

In the last few recipes, we've seen how to generate term frequencies and scale them by the

size of the document so that the frequencies from two different documents can be compared.

Term frequencies also have another problem. They don't tell you how important a term is,

relative to all of the documents in the corpus.

To address this, we will use term frequency-inverse document frequency (TF-IDF).

This metric scales the term's frequency in a document by the term's frequency in the

entire corpus.

In this recipe, we'll assemble the parts needed to implement TF-IDF.

Getting ready

We'll continue building on the previous recipes in this chapter. Because of that, we'll use the

same project.clj ile:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"

:dependencies [[org.clojure/clojure "1.6.0"]

[clojure-opennlp "0.3.2"]])

We'll also use two functions that we've created earlier in this chapter. From the Tokenizing text

recipe, we'll use tokenize . From the Focusing on content words with stoplists recipe, we'll

use normalize .

Aside from the imports required for these two functions, we'll also want to have this available

in our source code or REPL:

(require '[clojure.set :as set])

For this recipe, we'll also need more data than we've been using. For this, we'll use a corpus

of State of the Union (SOTU) addresses from United States presidents over time. These are

yearly addresses that presidents make where they talk about the events of the past year

and outline their priorities over the next twelve months. You can download these from

I've unpacked the data from this ile into the sotu directory.

Search WWH ::

Custom Search

Home