Finding Parts of Text - Natural Language Processing with Java - page 71

Java Reference

In-Depth Information

Parameter

Usage

int

Specifies how many times a feature is processed

int

The number of iterations used to train the maxent model

In the example that follows, we start by defining a BufferedOutputStream object

that will be used to store the new model. Several of the methods used in the example will

generate exceptions, which are handled in catch blocks:

BufferedOutputStream modelOutputStream = null;

try {

…

} catch (UnsupportedEncodingException ex) {

// Handle the exception

} catch (IOException ex) {

// Handle the exception

}

An instance of an ObjectStream class is created using the PlainTex-

tByLineStream class. This uses the training file and the character encoding scheme as

its constructor arguments. This is used to create a second ObjectStream instance of

the TokenSample objects. These objects are text with token span information included:

ObjectStream<String> lineStream = new PlainTextByLineStream(

new FileInputStream("training-data.train"), "UTF-8");

ObjectStream<TokenSample> sampleStream =

new TokenSampleStream(lineStream);

The train method can now be used as shown in the following code. English is specified

as the language. Alphanumeric information is ignored. The feature and iteration values are

set to 5 and 100 respectively.

TokenizerModel model = TokenizerME.train(

"en", sampleStream, true, 5, 100);

The parameters of the train method are given in detail in the following table:

Next Page

Natural Language Processing with Java

Search WWH ::

Custom Search

Home