Finding Parts of Text - Natural Language Processing with Java

Java Reference

In-Depth Information

Normalizing using a pipeline

In this section, we will combine many of the normalization techniques using a pipeline. To

demonstrate this process, we will expand upon the example used in Using LingPipe to re-

move stopwords. We will add two additional factories to normalize text: Lower-

CaseTokenizerFactory and PorterStemmerTokenizerFactory .

The LowerCaseTokenizerFactory factory is added before the creation of the Eng-

lishStopTokenizerFactory and the PorterStemmerTokenizerFactory

after the creation of the EnglishStopTokenizerFactory , as shown here:

paragraph = "A simple approach is to create a class "

+ "to hold and remove stopwords.";

TokenizerFactory factory =

IndoEuropeanTokenizerFactory.INSTANCE;

factory = new LowerCaseTokenizerFactory(factory);

factory = new EnglishStopTokenizerFactory(factory);

factory = new PorterStemmerTokenizerFactory(factory);

Tokenizer tokenizer =

factory.tokenizer(paragraph.toCharArray(), 0,

paragraph.length());

for (String token : tokenizer) {

System.out.println(token);

}

The output is as follows:

simpl

approach

creat

class

hold

remov

stopword

.

What we have left are the stems of the words in lowercase with the stopwords removed.

Search WWH ::

Custom Search

Home