Java Reference
In-Depth Information
Using the Stanford tokenizer
Tokenization is supported by several Stanford NLP API classes; a few of them are as fol-
lows:
• The
PTBTokenizer
class
• The
DocumentPreprocessor
class
• The
StanfordCoreNLP
class as a pipeline
Each of these examples will use the
paragraph
string as defined earlier.
Using the PTBTokenizer class
This tokenizer mimics the
Penn Treebank 3
(
PTB
) tokenizer (
http://www.cis.upenn.edu/
~treebank/
)
. It differs from PTB in terms of its options and its support for Unicode. The
PTBTokenizer
class supports several older constructors; however, it is suggested that
the three-argument constructor be used. This constructor uses a
Reader
object, a
LexedTokenFactory<T>
argument, and a string to specify which of the several options
to use.
The
LexedTokenFactory
interface is implemented by the
CoreLabelTokenFact-
ory
and
WordTokenFactory
classes. The former class supports the retention of the be-
ginning and ending character positions of a token whereas the latter class simply returns a
token as a string without any positional information. The
WordTokenFactory
class is
used by default. We will demonstrate the use of both classes.
The
CoreLabelTokenFactory
class is used in the following example. A
StringReader
instance is created using
paragraph
. The last argument is used for the
options, which is
null
for this example. The
Iterator
interface is implemented by the
PTBTokenizer
class allowing us to use the
hasNext
and
next
method to display the
tokens.
PTBTokenizer ptb = new PTBTokenizer(
new StringReader(paragraph), new
CoreLabelTokenFactory(),null);
while (ptb.hasNext()) {
System.out.println(ptb.next());
}