Java Reference
In-Depth Information
Removing stopwords
There are several approaches to remove stopwords. A simple approach is to create a class
to hold and remove stopwords. Also, several NLP APIs provide support for stopword re-
moval. We will create a simple class called StopWords to demonstrate the first approach.
We will then use LingPipe's EnglishStopTokenizerFactory class to demonstrate
the second approach.
Creating a StopWords class
The process of removing stopwords involves examining a stream of tokens, comparing
them to a list of stopwords, and then removing the stopwords from the stream. To illustrate
this approach, we will create a simple class that supports basic operations as defined in the
following table:
Constructor/Method
Usage
Default constructor
Uses a default set of stopwords
Single argument constructor Uses stopwords stored in a file
addStopWord
Adds a new stopword to the internal list
removeStopWords
Accepts an array of words and returns a new array with the stopwords removed
Create a class called StopWords , which declares two instance variables as shown in the
following code block. The variable defaultStopWords is an array that holds the de-
fault stopword list. The HashSet variable stopwords list is used to hold the stopwords
for processing purposes:
public class StopWords {
private String[] defaultStopWords = {"i", "a", "about",
"an", "are", "as", "at", "be", "by", "com", "for", "from",
"how", "in", "is", "it", "of", "on", "or", "that", "the",
"this", "to", "was", "what", "when", where", "who", "will",
"with"};
Search WWH ::




Custom Search