Java Reference
In-Depth Information
Using OpenNLP
The
DocumentCategorizer
interface specifies methods to support the classification
process. The interface is implemented by the
DocumentCategorizerME
class. This
class will classify text into predefined categories using a maximum entropy framework. We
will:
• Demonstrate how to train the model
• Illustrate how the model can be used
Training an OpenNLP classification model
First, we have to train our model because OpenNLP does not have prebuilt models. This
process consists of creating a file of training data and then using the
DocumentCat-
egorizerME
model to perform the actual training. The model created is typically saved
in a file for later use.
The training file format consists of a series of lines where each line represents a document.
The first word of the line is the category. The category is followed by text separated by
whitespace. Here is an example for the
dog
category:
dog The most interesting feature of a dog is its ...
To demonstrate the training process, we created the
en-animals.train
file, where we
created two categories: cats and dogs. For the training text, we used sections of Wikipedia.
For dogs (
http://en.wikipedia.org/wiki/Dog
)
, we used the
As Pets
section. For cats (
ht-
tp://en.wikipedia.org/wiki/Cats_and_humans
)
, we used the
Pet
section plus the first para-
graph of the
Domesticated varieties
section. We also removed the numeric references from
the sections.
The first part of each line is shown here:
dog The most widespread form of interspecies bonding occurs
...
dog There have been two major trends in the changing status
of ...
dog There are a vast range of commodity forms available to
...
dog An Australian Cattle Dog in reindeer antlers sits on
Santa's lap ...