Java Reference
In-Depth Information
Using the Stanford API
The Stanford NLP library supports several techniques used to perform sentence detection.
In this section, we will demonstrate the process using the following classes:
•
PTBTokenizer
•
DocumentPreprocessor
•
StanfordCoreNLP
Although all of them perform SBD, each uses a different approach to performing the pro-
cess.
Using the PTBTokenizer class
The
PTBTokenizer
class uses rules to perform SBD and has a variety of tokenization
options. The constructor for this class possesses three parameters:
• A
Reader
class that encapsulates the text to be processed
• An object that implements the
LexedTokenFactory
interface
• A string holding the tokenization options
These options allow us to specify the text, the tokenizer to be used, and any options that we
may need to use for a specific text stream.
In the following code sequence, an instance of the
StringReader
class is created to en-
capsulate the text. The
CoreLabelTokenFactory
class is used with the options left as
null
for this example:
PTBTokenizer ptb = new PTBTokenizer(new
StringReader(paragraph), new CoreLabelTokenFactory(), null);
We will use the
WordToSentenceProcessor
class to create a
List
instance of
List
class to hold the sentences and their tokens. Its
process
method takes the tokens pro-
duced by the
PTBTokenizer
instance to create the
List
of
list class
as shown
here:
WordToSentenceProcessor wtsp = new WordToSentenceProcessor();
List<List<CoreLabel>> sents = wtsp.process(ptb.tokenize());