Database Reference
In-Depth Information
We will also need to require it into the current namespace:
(require '[opennlp.nlp :as nlp])
Finally, we'll download a model for a statistical sentence splitter. I downloaded
en-sent.bin
How to do it…
As in the
Tokenizing text
recipe, we will start by loading the sentence identiication model
data, as shown here:
(def get-sentences
(nlp/make-sentence-detector "models/en-sent.bin"))
Now, we use that data to split a text into a series of sentences, as follows:
user=> (get-sentences "I never saw a Purple Cow.
I never hope to see one.
But I can tell you, anyhow.
I'd rather see than be one.")
["I never saw a Purple Cow."
"I never hope to see one."
"But I can tell you, anyhow."
"I'd rather see than be one."]
How it works…
The data model in
models/en-sent.bin
contains the information that OpenNLP
needs to recreate a previously-trained sentence identiication algorithm. Once we have
reinstantiated this algorithm, we can use it to identify the sentences in a text, as we did
by calling
get-sentences
.
Focusing on content words with stoplists
Stoplists or stopwords are a list of words that should not be included in further analysis.
Usually, this is because they're so common that they don't add much information to the analysis.
These lists are usually dominated by what are known as function words—words that have a
grammatical purpose in the sentence, but which themselves do not carry any meaning. For
example,
the
indicates that the noun that follows is singular, but it does not have a meaning
by itself. Others prepositions, such as
after
, have a meaning, but they are so common that
they tend to get in the way.