Java Reference
In-Depth Information
Normalizing using a pipeline
In this section, we will combine many of the normalization techniques using a pipeline. To
demonstrate this process, we will expand upon the example used in
Using LingPipe
to re-
move stopwords. We will add two additional factories to normalize text:
Lower-
CaseTokenizerFactory
and
PorterStemmerTokenizerFactory
.
The
LowerCaseTokenizerFactory
factory is added before the creation of the
Eng-
lishStopTokenizerFactory
and the
PorterStemmerTokenizerFactory
after the creation of the
EnglishStopTokenizerFactory
, as shown here:
paragraph = "A simple approach is to create a class "
+ "to hold and remove stopwords.";
TokenizerFactory factory =
IndoEuropeanTokenizerFactory.INSTANCE;
factory = new LowerCaseTokenizerFactory(factory);
factory = new EnglishStopTokenizerFactory(factory);
factory = new PorterStemmerTokenizerFactory(factory);
Tokenizer tokenizer =
factory.tokenizer(paragraph.toCharArray(), 0,
paragraph.length());
for (String token : tokenizer) {
System.out.println(token);
}
The output is as follows:
simpl
approach
creat
class
hold
remov
stopword
.
What we have left are the stems of the words in lowercase with the stopwords removed.