Java Reference
In-Depth Information
Training a tokenizer to find parts of text
Training a tokenizer is useful when we encounter text that is not handled well by standard
tokenizers. Instead of writing a custom tokenizer, we can create a tokenizer model that can
be used to perform the tokenization.
To demonstrate how such a model can be created, we will read training data from a file and
then train a model using this data. The data is stored as a series of words separated by
whitespace and <SPLIT> fields. This <SPLIT> field is used to provide further informa-
tion about how tokens should be identified. They can help identify breaks between num-
bers, such as 23.6 , and punctuation characters such as commas. The training data we will
use is stored in the file training-data.train , and is shown here:
These fields are used to provide further information about
how tokens should be identified<SPLIT>.
They can help identify breaks between numbers<SPLIT>, such
as 23.6<SPLIT>, punctuation characters such as commas<SPLIT>.
The data that we use does not represent unique text, but it does illustrate how to annotate
text and the process used to train a model.
We will use the OpenNLP TokenizerME class' overloaded train method to create a
model. The last two parameters require additional explanations. The maxent is used to de-
termine the relationship between elements of text.
We can specify the number of features the model must address before it is included in the
model. These features can be thought of as aspects of the model. Iterations refer to the
number of times the training procedure will iterate when determining the model's paramet-
ers. Few of the TokenME class parameters are as follows:
Parameter
Usage
A code for the language used
String
ObjectStream<TokenSample> An ObjectStream parameter containing the training data
If true , then alphanumeric data is ignored
boolean
Search WWH ::




Custom Search