Java Reference
In-Depth Information
Using LingPipe to remove stopwords
LingPipe possesses the
EnglishStopTokenizerFactory
class that we will use to
identify and remove stopwords. The words in this list are found in
http://alias-i.com/ling-
words such as a, was, but, he, and for.
The
factory
class' constructor requires a
TokenizerFactory
instance as its argu-
ment. We will use the factory's
tokenizer
method to process a list of words and re-
move the stopwords. We start by declaring the string to be tokenized:
String paragraph = "A simple approach is to create a class
"
+ "to hold and remove stopwords.";
Next, we create an instance of a
TokenizerFactory
based on the
In-
doEuropeanTokenizerFactory
class. We then use that factory as the argument to
create our
EnglishStopTokenizerFactory
instance:
TokenizerFactory factory =
IndoEuropeanTokenizerFactory.INSTANCE;
factory = new EnglishStopTokenizerFactory(factory);
Using the LingPipe
Tokenizer
class and the factory's
tokenizer
method, the text as
declared in the
paragraph
variable is processed. The
tokenizer
method uses an ar-
ray of
char
, a starting index, and its length:
Tokenizer tokenizer =
factory.tokenizer(paragraph.toCharArray(), 0,
paragraph.length());
The following for-each statement will iterate over the revised list:
for (String token : tokenizer) {
System.out.println(token);
}
The output will be as follows: