Java Reference
In-Depth Information
Stemming with LingPipe
The PorterStemmerTokenizerFactory class is used to find stems using
LingPipe. In this example, we will use the same words array as in the previous section.
The IndoEuropeanTokenizerFactory class is used to perform the initial tokeniz-
ation followed by the use of the Porter Stemmer. These classes are defined here:
TokenizerFactory tokenizerFactory =
IndoEuropeanTokenizerFactory.INSTANCE;
TokenizerFactory porterFactory =
new PorterStemmerTokenizerFactory(tokenizerFactory);
An array to hold the stems is declared next. We reuse the words array declared in the
previous section. Each word is processed individually. The word is tokenized and its stem
is stored in stems as shown in the following code block. The words and their stems are
then displayed:
String[] stems = new String[words.length];
for (int i = 0; i < words.length; i++) {
Tokenization tokenizer = new
Tokenization(words[i],porterFactory);
stems = tokenizer.tokens();
System.out.print("Word: " + words[i]);
for (String stem : stems) {
System.out.println(" Stem: " + stem);
}
}
When executed, we get the following output:
Word: bank Stem: bank
Word: banking Stem: bank
Word: banks Stem: bank
Word: banker Stem: banker
Word: banked Stem: bank
Word: bankart Stem: bankart
Search WWH ::




Custom Search