Java Reference
In-Depth Information
Using the Stanford API
The Stanford NLP library supports several techniques used to perform sentence detection.
In this section, we will demonstrate the process using the following classes:
PTBTokenizer
DocumentPreprocessor
StanfordCoreNLP
Although all of them perform SBD, each uses a different approach to performing the pro-
cess.
Using the PTBTokenizer class
The PTBTokenizer class uses rules to perform SBD and has a variety of tokenization
options. The constructor for this class possesses three parameters:
• A Reader class that encapsulates the text to be processed
• An object that implements the LexedTokenFactory interface
• A string holding the tokenization options
These options allow us to specify the text, the tokenizer to be used, and any options that we
may need to use for a specific text stream.
In the following code sequence, an instance of the StringReader class is created to en-
capsulate the text. The CoreLabelTokenFactory class is used with the options left as
null for this example:
PTBTokenizer ptb = new PTBTokenizer(new
StringReader(paragraph), new CoreLabelTokenFactory(), null);
We will use the WordToSentenceProcessor class to create a List instance of List
class to hold the sentences and their tokens. Its process method takes the tokens pro-
duced by the PTBTokenizer instance to create the List of list class as shown
here:
WordToSentenceProcessor wtsp = new WordToSentenceProcessor();
List<List<CoreLabel>> sents = wtsp.process(ptb.tokenize());
Search WWH ::




Custom Search