Finding Sentences - Natural Language Processing with Java

Java Reference

In-Depth Information

Using the SentenceChunker class

An alternative approach is to use the SentenceChunker class to perform SBD. The

constructor of this class requires a TokenizerFactory object and a Sen-

tenceModel object, as shown here:

TokenizerFactory tokenizerfactory =

IndoEuropeanTokenizerFactory.INSTANCE;

SentenceModel sentenceModel = new

IndoEuropeanSentenceModel();

The SentenceChunker instance is created using the tokenizer factory and sentence in-

stances:

SentenceChunker sentenceChunker =

new SentenceChunker(tokenizerfactory, sentenceModel);

The SentenceChunker class implements the Chunker interface that uses a chunk

method. This method returns an object that implements the Chunking interface. This ob-

ject specifies "chunks" of text with a character sequence ( CharSequence ).

The chunk method uses a character array and indexes within the array to specify which

portions of the text need to be processed. A Chunking object is returned like this:

Chunking chunking = sentenceChunker.chunk(

paragraph.toCharArray(),0, paragraph.length());

We will use the Chunking object for two purposes. First, we will use its chunkSet

method to return a Set of Chunk objects. Then we will obtain a string holding all the

sentences:

Set<Chunk> sentences = chunking.chunkSet();

String slice = chunking.charSequence().toString();

A Chunk object stores character offsets of the sentence boundaries. We will use its

start and end methods in conjunction with the slice to display the sentences, as shown

next. Each element, sentence , holds the sentence's boundary. We use this information

to display each sentence in the slice:

Search WWH ::

Custom Search

Home