Java Reference
In-Depth Information
Normalizing using a pipeline
In this section, we will combine many of the normalization techniques using a pipeline. To
demonstrate this process, we will expand upon the example used in Using LingPipe to re-
move stopwords. We will add two additional factories to normalize text: Lower-
CaseTokenizerFactory and PorterStemmerTokenizerFactory .
The LowerCaseTokenizerFactory factory is added before the creation of the Eng-
lishStopTokenizerFactory and the PorterStemmerTokenizerFactory
after the creation of the EnglishStopTokenizerFactory , as shown here:
paragraph = "A simple approach is to create a class "
+ "to hold and remove stopwords.";
TokenizerFactory factory =
IndoEuropeanTokenizerFactory.INSTANCE;
factory = new LowerCaseTokenizerFactory(factory);
factory = new EnglishStopTokenizerFactory(factory);
factory = new PorterStemmerTokenizerFactory(factory);
Tokenizer tokenizer =
factory.tokenizer(paragraph.toCharArray(), 0,
paragraph.length());
for (String token : tokenizer) {
System.out.println(token);
}
The output is as follows:
simpl
approach
creat
class
hold
remov
stopword
.
What we have left are the stems of the words in lowercase with the stopwords removed.
Search WWH ::




Custom Search