Java Reference
In-Depth Information
Training a Sentence Detector model
We will use OpenNLP's
SentenceDetectorME
class to illustrate the training process.
This class has a static
train
method that uses sample sentences found in a file. The meth-
od returns a model that is usually serialized to a file for later use.
Models use specially annotated data to clearly specify where a sentence ends. Frequently, a
large file is used to provide a good sample for training purposes. Part of the file is used for
training purposes, and the rest is used to verify the model after it has been trained.
The training file used by OpenNLP consists of one sentence per line. Usually, at least 10 to
20 sample sentences are needed to avoid processing errors. To demonstrate the process, we
Leagues under the Sea by Jules Verne
. The text of the topic can be found at
ht-
tp://www.gutenberg.org/files/164/164-h/164-h.htm#chap05
. The file can be downloaded
from
www.packtpub.com
.
A
FileReader
object is used to open the file. This object is used as the argument of the
PlainTextByLineStream
constructor. The stream that results consists of a string for
each line of the file. This is used as the argument of the
SentenceSampleStream
con-
structor, which converts the sentence strings to
SentenceSample
objects. These objects
hold the beginning index of each sentence. This process is shown next, where the state-
ments are enclosed in a try block to handle exceptions that may be thrown by these state-
ments:
try {
ObjectStream<String> lineStream = new
PlainTextByLineStream(
new FileReader("sentence.train"));
ObjectStream<SentenceSample> sampleStream
= new SentenceSampleStream(lineStream);
...
} catch (FileNotFoundException ex) {
// Handle exception
} catch (IOException ex) {
// Handle exception
}