Finding Parts of Text - Natural Language Processing with Java

Java Reference

In-Depth Information

Stemming with LingPipe

The PorterStemmerTokenizerFactory class is used to find stems using

LingPipe. In this example, we will use the same words array as in the previous section.

The IndoEuropeanTokenizerFactory class is used to perform the initial tokeniz-

ation followed by the use of the Porter Stemmer. These classes are defined here:

TokenizerFactory tokenizerFactory =

IndoEuropeanTokenizerFactory.INSTANCE;

TokenizerFactory porterFactory =

new PorterStemmerTokenizerFactory(tokenizerFactory);

An array to hold the stems is declared next. We reuse the words array declared in the

previous section. Each word is processed individually. The word is tokenized and its stem

is stored in stems as shown in the following code block. The words and their stems are

then displayed:

String[] stems = new String[words.length];

for (int i = 0; i < words.length; i++) {

Tokenization tokenizer = new

Tokenization(words[i],porterFactory);

stems = tokenizer.tokens();

System.out.print("Word: " + words[i]);

for (String stem : stems) {

System.out.println(" Stem: " + stem);

}

When executed, we get the following output:

Word: bank Stem: bank

Word: banking Stem: bank

Word: banks Stem: bank

Word: banker Stem: banker

Word: banked Stem: bank

Word: bankart Stem: bankart

Search WWH ::

Custom Search

Home