Finding Parts of Text - Natural Language Processing with Java

Java Reference

In-Depth Information

Using the OpenNLPTokenizer class

OpenNLP possesses a Tokenizer interface that is implemented by three classes: Sim-

pleTokenizer , TokenizerME , and WhitespaceTokenizer . This interface sup-

ports two methods:

• tokenize : This is passed a string to tokenize and returns an array of tokens as

strings.

• tokenizePos : This is passed a string and returns an array of Span objects. The

Span class is used to specify the beginning and ending offsets of the tokens.

Each of these classes is demonstrated in the following sections.

Using the SimpleTokenizer class

As the name implies, the SimpleTokenizer class performs simple tokenization of text.

The INSTANCE field is used to instantiate the class as shown in the following code se-

quence. The tokenize method is executed against the paragraph variable and the

tokens are then displayed:

SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;

String tokens[] = simpleTokenizer.tokenize(paragraph);

for(String token : tokens) {

System.out.println(token);

}

When executed, we get the following output:

Let

'

s

pause

,

and

then

reflect

.

Using this tokenizer, punctuation is returned as separate tokens.

Search WWH ::

Custom Search

Home