Database Reference
In-Depth Information
We will also need to require it into the current namespace:
(require '[opennlp.nlp :as nlp])
Finally, we'll download a model for a statistical sentence splitter. I downloaded en-sent.bin
from http://opennlp.sourceforge.net/models-1.5/. I then saved it into models/
en-sent.bin .
How to do it…
As in the Tokenizing text recipe, we will start by loading the sentence identiication model
data, as shown here:
(def get-sentences
(nlp/make-sentence-detector "models/en-sent.bin"))
Now, we use that data to split a text into a series of sentences, as follows:
user=> (get-sentences "I never saw a Purple Cow.
I never hope to see one.
But I can tell you, anyhow.
I'd rather see than be one.")
["I never saw a Purple Cow."
"I never hope to see one."
"But I can tell you, anyhow."
"I'd rather see than be one."]
How it works…
The data model in models/en-sent.bin contains the information that OpenNLP
needs to recreate a previously-trained sentence identiication algorithm. Once we have
reinstantiated this algorithm, we can use it to identify the sentences in a text, as we did
by calling get-sentences .
Focusing on content words with stoplists
Stoplists or stopwords are a list of words that should not be included in further analysis.
Usually, this is because they're so common that they don't add much information to the analysis.
These lists are usually dominated by what are known as function words—words that have a
grammatical purpose in the sentence, but which themselves do not carry any meaning. For
example, the indicates that the noun that follows is singular, but it does not have a meaning
by itself. Others prepositions, such as after , have a meaning, but they are so common that
they tend to get in the way.
 
Search WWH ::




Custom Search