Database Reference
In-Depth Information
2. The stoplist will actually be represented by a Clojure set. This will make iltering a lot
easier. The load-stopwords function will read in the ile, break it into lines, and
fold them into a set, as follows:
(defn load-stopwords [filename]
(with-open [r (io/reader filename)]
(set (doall (line-seq r)))))
(def is-stopword (load-stopwords "stopwords/english"))
3.
Finally, we can load the tokens. This will break the input into sentences. Then, it will
tokenize each sentence, normalize its tokens, and remove its stopwords, as follows:
(def tokens
(map #(remove is-stopword (normalize (tokenize %)))
(get-sentences
"I never saw a Purple Cow.
I never hope to see one.
But I can tell you, anyhow.
I'd rather see than be one.")))
Now, you can see that the tokens returned are more focused on the content and are missing
all of the function words:
user=> (pprint tokens)
(("never" "saw" "purple" "cow" ".")
("never" "hope" "see" "one" ".")
("tell" "," "anyhow" ".")
("'d" "rather" "see" "one" "."))
Getting document frequencies
One common and useful metric to work with text corpora is to get the counts of the tokens in
the documents. This can be done quite easily by leveraging standard Clojure functions.
Let's see how.
Getting ready
We'll continue building on the previous recipes in this chapter. Because of that, we'll use the
same project.clj ile:
(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
:dependencies [[org.clojure/clojure "1.6.0"]
[clojure-opennlp "0.3.2"]])
 
Search WWH ::




Custom Search