Java Reference
In-Depth Information
Using the SentenceChunker class
An alternative approach is to use the
SentenceChunker
class to perform SBD. The
constructor of this class requires a
TokenizerFactory
object and a
Sen-
tenceModel
object, as shown here:
TokenizerFactory tokenizerfactory =
IndoEuropeanTokenizerFactory.INSTANCE;
SentenceModel sentenceModel = new
IndoEuropeanSentenceModel();
The
SentenceChunker
instance is created using the tokenizer factory and sentence in-
stances:
SentenceChunker sentenceChunker =
new SentenceChunker(tokenizerfactory, sentenceModel);
The
SentenceChunker
class implements the
Chunker
interface that uses a
chunk
method. This method returns an object that implements the
Chunking
interface. This ob-
ject specifies "chunks" of text with a character sequence (
CharSequence
).
The
chunk
method uses a character array and indexes within the array to specify which
portions of the text need to be processed. A
Chunking
object is returned like this:
Chunking chunking = sentenceChunker.chunk(
paragraph.toCharArray(),0, paragraph.length());
We will use the
Chunking
object for two purposes. First, we will use its
chunkSet
method to return a
Set
of
Chunk
objects. Then we will obtain a string holding all the
sentences:
Set<Chunk> sentences = chunking.chunkSet();
String slice = chunking.charSequence().toString();
A
Chunk
object stores character offsets of the sentence boundaries. We will use its
start
and
end
methods in conjunction with the slice to display the sentences, as shown
next. Each element,
sentence
, holds the sentence's boundary. We use this information
to display each sentence in the slice: