Java Reference
In-Depth Information
Training a model
We will use OpenNLP to demonstrate how a model is trained. The training file used must:
• Contain marks to demarcate the entities
• Have one sentence per line
We will use the following model file named
en-ner-person.train
:
<START:person> Joe <END> was the last person to see
<START:person> Fred <END>.
He saw him in Boston at McKenzie's pub at 3:00 where he paid
$2.45 for an ale.
<START:person> Joe <END> wanted to go to Vermont for the day
to visit a cousin who works at IBM, but <START:person> Sally
<END> and he had to look for <START:person> Fred <END>.
Several methods of this example are capable of throwing exceptions. These statements will
be placed in a try-with-resource block as shown here, where the model's output stream is
created:
try (OutputStream modelOutputStream = new
BufferedOutputStream(
new FileOutputStream(new File("modelFile")));) {
...
} catch (IOException ex) {
// Handle exception
}
Within the block, we create an
OutputStream<String>
object using the
PlainTex-
tByLineStream
class. This class' constructor takes a
FileInputStream
instance
and returns each line as a
String
object. The
en-ner-person.train
file is used as
the input file, as shown here. The
UTF-8
string refers to the encoding sequence used:
ObjectStream<String> lineStream = new PlainTextByLineStream(
new FileInputStream("en-ner-person.train"), "UTF-8");
The
lineStream
object contains streams that are annotated with tags delineating the en-
tities in the text. These need to be converted to the
NameSample
objects so that the model