Finding Sentences - Natural Language Processing with Java

Java Reference

In-Depth Information

Training a Sentence Detector model

We will use OpenNLP's SentenceDetectorME class to illustrate the training process.

This class has a static train method that uses sample sentences found in a file. The meth-

od returns a model that is usually serialized to a file for later use.

Models use specially annotated data to clearly specify where a sentence ends. Frequently, a

large file is used to provide a good sample for training purposes. Part of the file is used for

training purposes, and the rest is used to verify the model after it has been trained.

The training file used by OpenNLP consists of one sentence per line. Usually, at least 10 to

20 sample sentences are needed to avoid processing errors. To demonstrate the process, we

will use a file called sentence.train . It consists of Chapter 5 , Twenty Thousand

Leagues under the Sea by Jules Verne . The text of the topic can be found at ht-

tp://www.gutenberg.org/files/164/164-h/164-h.htm#chap05 . The file can be downloaded

from www.packtpub.com .

A FileReader object is used to open the file. This object is used as the argument of the

PlainTextByLineStream constructor. The stream that results consists of a string for

each line of the file. This is used as the argument of the SentenceSampleStream con-

structor, which converts the sentence strings to SentenceSample objects. These objects

hold the beginning index of each sentence. This process is shown next, where the state-

ments are enclosed in a try block to handle exceptions that may be thrown by these state-

ments:

try {

ObjectStream<String> lineStream = new

PlainTextByLineStream(

new FileReader("sentence.train"));

ObjectStream<SentenceSample> sampleStream

= new SentenceSampleStream(lineStream);

...

} catch (FileNotFoundException ex) {

// Handle exception

} catch (IOException ex) {

// Handle exception

}

Search WWH ::

Custom Search

Home