Java Reference
In-Depth Information
Using LingPipe to classify text
We will use LingPipe to demonstrate a number of classification tasks including general text
classification using trained models, sentiment analysis, and language identification. We will
cover the following classification topics:
• Training text using the
Classified
class
• Training models using other training categories
• How to classify text using LingPipe
• Performing sentiment analysis using LingPipe
• Identifying the language used
Several of the tasks described in this section will use the following declarations. LingPipe
comes with training data for several categories. The
categories
array contains the
names of the categories packaged with LingPipe:
String[] categories = {"soc.religion.christian",
"talk.religion.misc","alt.atheism","misc.forsale"};
The
DynamicLMClassifier
class is used to perform the actual classification. It is cre-
ated using the
categories
array giving it the names of the categories to use. The
nGramSize
value specifies the number of contiguous items in a sequence used in the
model for classification purposes:
int nGramSize = 6;
DynamicLMClassifier<NGramProcessLM> classifier =
DynamicLMClassifier.createNGramProcess(
categories, nGramSize);
Training text using the Classified class
General text classification using LingPipe involves training the
DynamicLMClassifi-
er
class using training files and then using the class to perform the actual classification.
LingPipe comes with several training datasets as found in the LingPipe directory,
demos/
data/fourNewsGroups/4news-train
. We will use these to illustrate the training
process. This example is a simplified version of the process found at
http://alias-i.com/ling-
We start by declaring the training directory: