Java Reference
In-Depth Information
Using the OpenNLPTokenizer class
OpenNLP possesses a
Tokenizer
interface that is implemented by three classes:
Sim-
pleTokenizer
,
TokenizerME
, and
WhitespaceTokenizer
. This interface sup-
ports two methods:
•
tokenize
: This is passed a string to tokenize and returns an array of tokens as
strings.
•
tokenizePos
: This is passed a string and returns an array of
Span
objects. The
Span
class is used to specify the beginning and ending offsets of the tokens.
Each of these classes is demonstrated in the following sections.
Using the SimpleTokenizer class
As the name implies, the
SimpleTokenizer
class performs simple tokenization of text.
The
INSTANCE
field is used to instantiate the class as shown in the following code se-
quence. The
tokenize
method is executed against the
paragraph
variable and the
tokens are then displayed:
SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;
String tokens[] = simpleTokenizer.tokenize(paragraph);
for(String token : tokens) {
System.out.println(token);
}
When executed, we get the following output:
Let
'
s
pause
,
and
then
reflect
.
Using this tokenizer, punctuation is returned as separate tokens.