Database Reference
In-Depth Information
On the other hand, chair has a meaning beyond what it's doing in the sentence, and in fact,
it's role in the sentence will vary (subject, direct object, and so on).
You don't always want to use stopwords since they throw away information. However, since
function words are more frequent than content words, sometimes focusing on the content
words can add clarity to your analysis and its output. Also, they can speed up the processing.
Getting ready
This recipe will build on the work that we've done so far in this chapter. As such, it will use the
same project.clj ile that we used in the Tokenizing text and Finding sentences recipes:
(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
:dependencies [[org.clojure/clojure "1.6.0"]
[clojure-opennlp "0.3.2"]])
However, we'll use a slightly different set of requirements for this recipe:
(require '[opennlp.nlp :as nlp]
'[clojure.java.io :as io])
We'll also need to have a list of stopwords. You can easily create your own list, but for the
purpose of this recipe, we'll use the English stopword list included with the Natural Language
Toolkit ( http://www.nltk.org/ ) . You can download this from http://nltk.github.
com/nltk_data/packages/corpora/stopwords.zip . Unzip it into your project directory
and make sure that the stopwords/english ile exists.
We'll also use the tokenize and get-sentences functions that we created in the previous
two recipes.
How to do it…
We'll need to create a function in order to process and normalize the tokens. Also, we'll need
a utility function to load the stopword list. Once these are in place, we'll see how to use the
stopwords. To do this, perform the following steps:
1.
The words in the stopword list have been lowercased. We can also do this with the
tokens that we create. We'll use the normalize function to handle the lowercasing
of each token:
(defn normalize [token-seq]
(map #(.toLowerCase %) token-seq))
 
Search WWH ::




Custom Search