Java Reference
In-Depth Information
Apache OpenNLP
The Apache OpenNLP project addresses common NLP tasks and will be used throughout
this topic. It consists of several components that perform specific tasks, permit models to
be trained, and support for testing the models. The general approach, used by OpenNLP, is
to instantiate a model that supports the task from a file and then executes methods against
the model to perform a task.
For example, in the following sequence, we will tokenize a simple string. For this code to
execute properly, it must handle the FileNotFoundException and IOException
exceptions. We use a try-with-resource block to open a FileInputStream instance us-
ing the en-token.bin file. This file contains a model that has been trained using Eng-
lish text:
try (InputStream is = new FileInputStream(
new File(getModelDir(), "en-token.bin"))){
// Insert code to tokenize the text
} catch (FileNotFoundException ex) {
} catch (IOException ex) {
}
An instance of the TokenizerModel class is then created using this file inside the try
block. Next, we create an instance of the Tokenizer class, as shown here:
TokenizerModel model = new TokenizerModel(is);
Tokenizer tokenizer = new TokenizerME(model);
The tokenize method is then applied, whose argument is the text to be tokenized. The
method returns an array of String objects:
String tokens[] = tokenizer.tokenize("He lives at 1511 W." +
"Randolph.");
A for-each statement displays the tokens as shown here. The open and close brackets are
used to clearly identify the tokens:
Search WWH ::




Custom Search