Java Reference
In-Depth Information
Training a model
We will use OpenNLP to demonstrate how a model is trained. The training file used must:
• Contain marks to demarcate the entities
• Have one sentence per line
We will use the following model file named en-ner-person.train :
<START:person> Joe <END> was the last person to see
<START:person> Fred <END>.
He saw him in Boston at McKenzie's pub at 3:00 where he paid
$2.45 for an ale.
<START:person> Joe <END> wanted to go to Vermont for the day
to visit a cousin who works at IBM, but <START:person> Sally
<END> and he had to look for <START:person> Fred <END>.
Several methods of this example are capable of throwing exceptions. These statements will
be placed in a try-with-resource block as shown here, where the model's output stream is
created:
try (OutputStream modelOutputStream = new
BufferedOutputStream(
new FileOutputStream(new File("modelFile")));) {
...
} catch (IOException ex) {
// Handle exception
}
Within the block, we create an OutputStream<String> object using the PlainTex-
tByLineStream class. This class' constructor takes a FileInputStream instance
and returns each line as a String object. The en-ner-person.train file is used as
the input file, as shown here. The UTF-8 string refers to the encoding sequence used:
ObjectStream<String> lineStream = new PlainTextByLineStream(
new FileInputStream("en-ner-person.train"), "UTF-8");
The lineStream object contains streams that are annotated with tags delineating the en-
tities in the text. These need to be converted to the NameSample objects so that the model
Search WWH ::




Custom Search