Java Reference
In-Depth Information
Using OpenNLP
The DocumentCategorizer interface specifies methods to support the classification
process. The interface is implemented by the DocumentCategorizerME class. This
class will classify text into predefined categories using a maximum entropy framework. We
will:
• Demonstrate how to train the model
• Illustrate how the model can be used
Training an OpenNLP classification model
First, we have to train our model because OpenNLP does not have prebuilt models. This
process consists of creating a file of training data and then using the DocumentCat-
egorizerME model to perform the actual training. The model created is typically saved
in a file for later use.
The training file format consists of a series of lines where each line represents a document.
The first word of the line is the category. The category is followed by text separated by
whitespace. Here is an example for the dog category:
dog The most interesting feature of a dog is its ...
To demonstrate the training process, we created the en-animals.train file, where we
created two categories: cats and dogs. For the training text, we used sections of Wikipedia.
For dogs ( http://en.wikipedia.org/wiki/Dog ) , we used the As Pets section. For cats ( ht-
tp://en.wikipedia.org/wiki/Cats_and_humans ) , we used the Pet section plus the first para-
graph of the Domesticated varieties section. We also removed the numeric references from
the sections.
The first part of each line is shown here:
dog The most widespread form of interspecies bonding occurs
...
dog There have been two major trends in the changing status
of ...
dog There are a vast range of commodity forms available to
...
dog An Australian Cattle Dog in reindeer antlers sits on
Santa's lap ...
Search WWH ::




Custom Search