Java Reference
In-Depth Information
Apache OpenNLP
The Apache OpenNLP project addresses common NLP tasks and will be used throughout
this topic. It consists of several components that perform specific tasks, permit models to
be trained, and support for testing the models. The general approach, used by OpenNLP, is
to instantiate a model that supports the task from a file and then executes methods against
the model to perform a task.
For example, in the following sequence, we will tokenize a simple string. For this code to
execute properly, it must handle the
FileNotFoundException
and
IOException
exceptions. We use a try-with-resource block to open a
FileInputStream
instance us-
ing the
en-token.bin
file. This file contains a model that has been trained using Eng-
lish text:
try (InputStream is = new FileInputStream(
new File(getModelDir(), "en-token.bin"))){
// Insert code to tokenize the text
} catch (FileNotFoundException ex) {
…
} catch (IOException ex) {
…
}
An instance of the
TokenizerModel
class is then created using this file inside the
try
block. Next, we create an instance of the
Tokenizer
class, as shown here:
TokenizerModel model = new TokenizerModel(is);
Tokenizer tokenizer = new TokenizerME(model);
The
tokenize
method is then applied, whose argument is the text to be tokenized. The
method returns an array of
String
objects:
String tokens[] = tokenizer.tokenize("He lives at 1511 W." +
"Randolph.");
A for-each statement displays the tokens as shown here. The open and close brackets are
used to clearly identify the tokens: