Finding Sentences - Natural Language Processing with Java

Java Reference

In-Depth Information

Using the Stanford API

The Stanford NLP library supports several techniques used to perform sentence detection.

In this section, we will demonstrate the process using the following classes:

• PTBTokenizer

• DocumentPreprocessor

• StanfordCoreNLP

Although all of them perform SBD, each uses a different approach to performing the pro-

cess.

Using the PTBTokenizer class

The PTBTokenizer class uses rules to perform SBD and has a variety of tokenization

options. The constructor for this class possesses three parameters:

• A Reader class that encapsulates the text to be processed

• An object that implements the LexedTokenFactory interface

• A string holding the tokenization options

These options allow us to specify the text, the tokenizer to be used, and any options that we

may need to use for a specific text stream.

In the following code sequence, an instance of the StringReader class is created to en-

capsulate the text. The CoreLabelTokenFactory class is used with the options left as

null for this example:

PTBTokenizer ptb = new PTBTokenizer(new

StringReader(paragraph), new CoreLabelTokenFactory(), null);

We will use the WordToSentenceProcessor class to create a List instance of List

class to hold the sentences and their tokens. Its process method takes the tokens pro-

duced by the PTBTokenizer instance to create the List of list class as shown

here:

WordToSentenceProcessor wtsp = new WordToSentenceProcessor();

List<List<CoreLabel>> sents = wtsp.process(ptb.tokenize());

Search WWH ::

Custom Search

Home