Introduction to NLP - Natural Language Processing with Java

Java Reference

In-Depth Information

Apache OpenNLP

The Apache OpenNLP project addresses common NLP tasks and will be used throughout

this topic. It consists of several components that perform specific tasks, permit models to

be trained, and support for testing the models. The general approach, used by OpenNLP, is

to instantiate a model that supports the task from a file and then executes methods against

the model to perform a task.

For example, in the following sequence, we will tokenize a simple string. For this code to

execute properly, it must handle the FileNotFoundException and IOException

exceptions. We use a try-with-resource block to open a FileInputStream instance us-

ing the en-token.bin file. This file contains a model that has been trained using Eng-

lish text:

try (InputStream is = new FileInputStream(

new File(getModelDir(), "en-token.bin"))){

// Insert code to tokenize the text

} catch (FileNotFoundException ex) {

…

} catch (IOException ex) {

…

}

An instance of the TokenizerModel class is then created using this file inside the try

block. Next, we create an instance of the Tokenizer class, as shown here:

TokenizerModel model = new TokenizerModel(is);

Tokenizer tokenizer = new TokenizerME(model);

The tokenize method is then applied, whose argument is the text to be tokenized. The

method returns an array of String objects:

String tokens[] = tokenizer.tokenize("He lives at 1511 W." +

"Randolph.");

A for-each statement displays the tokens as shown here. The open and close brackets are

used to clearly identify the tokens:

Search WWH ::

Custom Search

Home