Classifying Texts and Documents - Natural Language Processing with Java

Java Reference

In-Depth Information

Using LingPipe to classify text

We will use LingPipe to demonstrate a number of classification tasks including general text

classification using trained models, sentiment analysis, and language identification. We will

cover the following classification topics:

• Training text using the Classified class

• Training models using other training categories

• How to classify text using LingPipe

• Performing sentiment analysis using LingPipe

• Identifying the language used

Several of the tasks described in this section will use the following declarations. LingPipe

comes with training data for several categories. The categories array contains the

names of the categories packaged with LingPipe:

String[] categories = {"soc.religion.christian",

"talk.religion.misc","alt.atheism","misc.forsale"};

The DynamicLMClassifier class is used to perform the actual classification. It is cre-

ated using the categories array giving it the names of the categories to use. The

nGramSize value specifies the number of contiguous items in a sequence used in the

model for classification purposes:

int nGramSize = 6;

DynamicLMClassifier<NGramProcessLM> classifier =

DynamicLMClassifier.createNGramProcess(

categories, nGramSize);

Training text using the Classified class

General text classification using LingPipe involves training the DynamicLMClassifi-

er class using training files and then using the class to perform the actual classification.

LingPipe comes with several training datasets as found in the LingPipe directory, demos/

data/fourNewsGroups/4news-train . We will use these to illustrate the training

process. This example is a simplified version of the process found at http://alias-i.com/ling-

We start by declaring the training directory:

Search WWH ::

Custom Search

Home