Java Reference
In-Depth Information
Using LingPipe to remove stopwords
LingPipe possesses the EnglishStopTokenizerFactory class that we will use to
identify and remove stopwords. The words in this list are found in http://alias-i.com/ling-
pipe/docs/api/com/aliasi/tokenizer/EnglishStopTokenizerFactory.html . They include
words such as a, was, but, he, and for.
The factory class' constructor requires a TokenizerFactory instance as its argu-
ment. We will use the factory's tokenizer method to process a list of words and re-
move the stopwords. We start by declaring the string to be tokenized:
String paragraph = "A simple approach is to create a class
"
+ "to hold and remove stopwords.";
Next, we create an instance of a TokenizerFactory based on the In-
doEuropeanTokenizerFactory class. We then use that factory as the argument to
create our EnglishStopTokenizerFactory instance:
TokenizerFactory factory =
IndoEuropeanTokenizerFactory.INSTANCE;
factory = new EnglishStopTokenizerFactory(factory);
Using the LingPipe Tokenizer class and the factory's tokenizer method, the text as
declared in the paragraph variable is processed. The tokenizer method uses an ar-
ray of char , a starting index, and its length:
Tokenizer tokenizer =
factory.tokenizer(paragraph.toCharArray(), 0,
paragraph.length());
The following for-each statement will iterate over the revised list:
for (String token : tokenizer) {
System.out.println(token);
}
The output will be as follows:
Search WWH ::




Custom Search