Java Reference
In-Depth Information
Using the Stanford tokenizer
Tokenization is supported by several Stanford NLP API classes; a few of them are as fol-
lows:
• The PTBTokenizer class
• The DocumentPreprocessor class
• The StanfordCoreNLP class as a pipeline
Each of these examples will use the paragraph string as defined earlier.
Using the PTBTokenizer class
This tokenizer mimics the Penn Treebank 3 ( PTB ) tokenizer ( http://www.cis.upenn.edu/
~treebank/ ) . It differs from PTB in terms of its options and its support for Unicode. The
PTBTokenizer class supports several older constructors; however, it is suggested that
the three-argument constructor be used. This constructor uses a Reader object, a
LexedTokenFactory<T> argument, and a string to specify which of the several options
to use.
The LexedTokenFactory interface is implemented by the CoreLabelTokenFact-
ory and WordTokenFactory classes. The former class supports the retention of the be-
ginning and ending character positions of a token whereas the latter class simply returns a
token as a string without any positional information. The WordTokenFactory class is
used by default. We will demonstrate the use of both classes.
The CoreLabelTokenFactory class is used in the following example. A
StringReader instance is created using paragraph . The last argument is used for the
options, which is null for this example. The Iterator interface is implemented by the
PTBTokenizer class allowing us to use the hasNext and next method to display the
tokens.
PTBTokenizer ptb = new PTBTokenizer(
new StringReader(paragraph), new
CoreLabelTokenFactory(),null);
while (ptb.hasNext()) {
System.out.println(ptb.next());
}
Search WWH ::




Custom Search