Database Reference
In-Depth Information
On the other hand,
chair
has a meaning beyond what it's doing in the sentence, and in fact,
it's role in the sentence will vary (subject, direct object, and so on).
You don't always want to use stopwords since they throw away information. However, since
function words are more frequent than content words, sometimes focusing on the content
words can add clarity to your analysis and its output. Also, they can speed up the processing.
Getting ready
This recipe will build on the work that we've done so far in this chapter. As such, it will use the
same
project.clj
ile that we used in the
Tokenizing text
and
Finding sentences
recipes:
(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
:dependencies [[org.clojure/clojure "1.6.0"]
[clojure-opennlp "0.3.2"]])
However, we'll use a slightly different set of requirements for this recipe:
(require '[opennlp.nlp :as nlp]
'[clojure.java.io :as io])
We'll also need to have a list of stopwords. You can easily create your own list, but for the
purpose of this recipe, we'll use the English stopword list included with the Natural Language
com/nltk_data/packages/corpora/stopwords.zip
.
Unzip it into your project directory
and make sure that the
stopwords/english
ile exists.
We'll also use the
tokenize
and
get-sentences
functions that we created in the previous
two recipes.
How to do it…
We'll need to create a function in order to process and normalize the tokens. Also, we'll need
a utility function to load the stopword list. Once these are in place, we'll see how to use the
stopwords. To do this, perform the following steps:
1.
The words in the stopword list have been lowercased. We can also do this with the
tokens that we create. We'll use the
normalize
function to handle the lowercasing
of each token:
(defn normalize [token-seq]
(map #(.toLowerCase %) token-seq))