Finding Parts of Text - Natural Language Processing with Java

Java Reference

In-Depth Information

Using LingPipe to remove stopwords

LingPipe possesses the EnglishStopTokenizerFactory class that we will use to

identify and remove stopwords. The words in this list are found in http://alias-i.com/ling-

words such as a, was, but, he, and for.

The factory class' constructor requires a TokenizerFactory instance as its argu-

ment. We will use the factory's tokenizer method to process a list of words and re-

move the stopwords. We start by declaring the string to be tokenized:

String paragraph = "A simple approach is to create a class

"

+ "to hold and remove stopwords.";

Next, we create an instance of a TokenizerFactory based on the In-

doEuropeanTokenizerFactory class. We then use that factory as the argument to

create our EnglishStopTokenizerFactory instance:

TokenizerFactory factory =

IndoEuropeanTokenizerFactory.INSTANCE;

factory = new EnglishStopTokenizerFactory(factory);

Using the LingPipe Tokenizer class and the factory's tokenizer method, the text as

declared in the paragraph variable is processed. The tokenizer method uses an ar-

ray of char , a starting index, and its length:

Tokenizer tokenizer =

factory.tokenizer(paragraph.toCharArray(), 0,

paragraph.length());

The following for-each statement will iterate over the revised list:

for (String token : tokenizer) {

System.out.println(token);

}

The output will be as follows:

Search WWH ::

Custom Search

Home