Finding Parts of Text - Natural Language Processing with Java

Java Reference

In-Depth Information

Using the Stanford tokenizer

Tokenization is supported by several Stanford NLP API classes; a few of them are as fol-

lows:

• The PTBTokenizer class

• The DocumentPreprocessor class

• The StanfordCoreNLP class as a pipeline

Each of these examples will use the paragraph string as defined earlier.

Using the PTBTokenizer class

This tokenizer mimics the Penn Treebank 3 ( PTB ) tokenizer ( http://www.cis.upenn.edu/

~treebank/ ) . It differs from PTB in terms of its options and its support for Unicode. The

PTBTokenizer class supports several older constructors; however, it is suggested that

the three-argument constructor be used. This constructor uses a Reader object, a

LexedTokenFactory<T> argument, and a string to specify which of the several options

to use.

The LexedTokenFactory interface is implemented by the CoreLabelTokenFact-

ory and WordTokenFactory classes. The former class supports the retention of the be-

ginning and ending character positions of a token whereas the latter class simply returns a

token as a string without any positional information. The WordTokenFactory class is

used by default. We will demonstrate the use of both classes.

The CoreLabelTokenFactory class is used in the following example. A

StringReader instance is created using paragraph . The last argument is used for the

options, which is null for this example. The Iterator interface is implemented by the

PTBTokenizer class allowing us to use the hasNext and next method to display the

tokens.

PTBTokenizer ptb = new PTBTokenizer(

new StringReader(paragraph), new

CoreLabelTokenFactory(),null);

while (ptb.hasNext()) {

System.out.println(ptb.next());

}

Search WWH ::

Custom Search

Home