Java Reference
In-Depth Information
Parameter
Usage
int
Specifies how many times a feature is processed
int
The number of iterations used to train the maxent model
In the example that follows, we start by defining a
BufferedOutputStream
object
that will be used to store the new model. Several of the methods used in the example will
generate exceptions, which are handled in catch blocks:
BufferedOutputStream modelOutputStream = null;
try {
…
} catch (UnsupportedEncodingException ex) {
// Handle the exception
} catch (IOException ex) {
// Handle the exception
}
An instance of an
ObjectStream
class is created using the
PlainTex-
tByLineStream
class. This uses the training file and the character encoding scheme as
its constructor arguments. This is used to create a second
ObjectStream
instance of
the
TokenSample
objects. These objects are text with token span information included:
ObjectStream<String> lineStream = new PlainTextByLineStream(
new FileInputStream("training-data.train"), "UTF-8");
ObjectStream<TokenSample> sampleStream =
new TokenSampleStream(lineStream);
The
train
method can now be used as shown in the following code. English is specified
as the language. Alphanumeric information is ignored. The feature and iteration values are
set to
5
and
100
respectively.
TokenizerModel model = TokenizerME.train(
"en", sampleStream, true, 5, 100);
The parameters of the train method are given in detail in the following table: