Java Reference
In-Depth Information
Training a Sentence Detector model
We will use OpenNLP's SentenceDetectorME class to illustrate the training process.
This class has a static train method that uses sample sentences found in a file. The meth-
od returns a model that is usually serialized to a file for later use.
Models use specially annotated data to clearly specify where a sentence ends. Frequently, a
large file is used to provide a good sample for training purposes. Part of the file is used for
training purposes, and the rest is used to verify the model after it has been trained.
The training file used by OpenNLP consists of one sentence per line. Usually, at least 10 to
20 sample sentences are needed to avoid processing errors. To demonstrate the process, we
will use a file called sentence.train . It consists of Chapter 5 , Twenty Thousand
Leagues under the Sea by Jules Verne . The text of the topic can be found at ht-
tp://www.gutenberg.org/files/164/164-h/164-h.htm#chap05 . The file can be downloaded
from www.packtpub.com .
A FileReader object is used to open the file. This object is used as the argument of the
PlainTextByLineStream constructor. The stream that results consists of a string for
each line of the file. This is used as the argument of the SentenceSampleStream con-
structor, which converts the sentence strings to SentenceSample objects. These objects
hold the beginning index of each sentence. This process is shown next, where the state-
ments are enclosed in a try block to handle exceptions that may be thrown by these state-
ments:
try {
ObjectStream<String> lineStream = new
PlainTextByLineStream(
new FileReader("sentence.train"));
ObjectStream<SentenceSample> sampleStream
= new SentenceSampleStream(lineStream);
...
} catch (FileNotFoundException ex) {
// Handle exception
} catch (IOException ex) {
// Handle exception
}
Search WWH ::




Custom Search