Java Reference
In-Depth Information
Training a tokenizer to find parts of text
Training a tokenizer is useful when we encounter text that is not handled well by standard
tokenizers. Instead of writing a custom tokenizer, we can create a tokenizer model that can
be used to perform the tokenization.
To demonstrate how such a model can be created, we will read training data from a file and
then train a model using this data. The data is stored as a series of words separated by
whitespace and
<SPLIT>
fields. This
<SPLIT>
field is used to provide further informa-
tion about how tokens should be identified. They can help identify breaks between num-
bers, such as
23.6
, and punctuation characters such as commas. The training data we will
use is stored in the file
training-data.train
, and is shown here:
These fields are used to provide further information about
how tokens should be identified<SPLIT>.
They can help identify breaks between numbers<SPLIT>, such
as 23.6<SPLIT>, punctuation characters such as commas<SPLIT>.
The data that we use does not represent unique text, but it does illustrate how to annotate
text and the process used to train a model.
We will use the OpenNLP
TokenizerME
class' overloaded
train
method to create a
model. The last two parameters require additional explanations. The maxent is used to de-
termine the relationship between elements of text.
We can specify the number of features the model must address before it is included in the
model. These features can be thought of as aspects of the model. Iterations refer to the
number of times the training procedure will iterate when determining the model's paramet-
ers. Few of the TokenME class parameters are as follows:
Parameter
Usage
A code for the language used
String
ObjectStream<TokenSample>
An
ObjectStream
parameter containing the training data
If
true
, then alphanumeric data is ignored
boolean