Classifying Texts and Documents - Natural Language Processing with Java

Java Reference

In-Depth Information

Using OpenNLP

The DocumentCategorizer interface specifies methods to support the classification

process. The interface is implemented by the DocumentCategorizerME class. This

class will classify text into predefined categories using a maximum entropy framework. We

will:

• Demonstrate how to train the model

• Illustrate how the model can be used

Training an OpenNLP classification model

First, we have to train our model because OpenNLP does not have prebuilt models. This

process consists of creating a file of training data and then using the DocumentCat-

egorizerME model to perform the actual training. The model created is typically saved

in a file for later use.

The training file format consists of a series of lines where each line represents a document.

The first word of the line is the category. The category is followed by text separated by

whitespace. Here is an example for the dog category:

dog The most interesting feature of a dog is its ...

To demonstrate the training process, we created the en-animals.train file, where we

created two categories: cats and dogs. For the training text, we used sections of Wikipedia.

For dogs ( http://en.wikipedia.org/wiki/Dog ) , we used the As Pets section. For cats ( ht-

tp://en.wikipedia.org/wiki/Cats_and_humans ) , we used the Pet section plus the first para-

graph of the Domesticated varieties section. We also removed the numeric references from

the sections.

The first part of each line is shown here:

dog The most widespread form of interspecies bonding occurs

...

dog There have been two major trends in the changing status

of ...

dog There are a vast range of commodity forms available to

...

dog An Australian Cattle Dog in reindeer antlers sits on

Santa's lap ...

Search WWH ::

Custom Search

Home