Java Reference
In-Depth Information
Using the OpenNLPTokenizer class
OpenNLP possesses a Tokenizer interface that is implemented by three classes: Sim-
pleTokenizer , TokenizerME , and WhitespaceTokenizer . This interface sup-
ports two methods:
tokenize : This is passed a string to tokenize and returns an array of tokens as
strings.
tokenizePos : This is passed a string and returns an array of Span objects. The
Span class is used to specify the beginning and ending offsets of the tokens.
Each of these classes is demonstrated in the following sections.
Using the SimpleTokenizer class
As the name implies, the SimpleTokenizer class performs simple tokenization of text.
The INSTANCE field is used to instantiate the class as shown in the following code se-
quence. The tokenize method is executed against the paragraph variable and the
tokens are then displayed:
SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;
String tokens[] = simpleTokenizer.tokenize(paragraph);
for(String token : tokens) {
System.out.println(token);
}
When executed, we get the following output:
Let
'
s
pause
,
and
then
reflect
.
Using this tokenizer, punctuation is returned as separate tokens.
Search WWH ::




Custom Search