Java Reference
In-Depth Information
}
String sentences[] = detector.sentDetect(sb.toString());
For the modified version of the topic file, this method created an array with 14,859 sen-
tences.
Next, we used the
toLowerCase
method to convert the text to lowercase. This was
done to ensure that when stop words are removed, the method will catch all of them.
for (int i = 0; i < sentences.length; i++) {
sentences[i] = sentences[i].toLowerCase();
}
Converting to lowercase and removing stop words restricts searches. However, this is con-
sidered to be a feature of this implementation and can be adjusted for other implementa-
tions.
Next, the stop words are removed. As mentioned earlier, an overloaded version of the
removeStopWords
method has been added to make it easier to use with this example.
The new method is shown here:
public String removeStopWords(String words) {
String arr[] =
WhitespaceTokenizer.INSTANCE.tokenize(words);
StringBuilder sb = new StringBuilder();
for (int i = 0; i < arr.length; i++) {
if (stopWords.contains(arr[i])) {
// Do nothing
} else {
sb.append(arr[i]+" ");
}
}
return sb.toString();
}
We created a
StopWords
instance using the
stop-words_english_2_en.txt
file
as shown in the following code sequence. This is one of several lists that can be down-
loaded from
https://code.google.com/p/stop-words/
.
We chose this file simply because it
contains stop words that we felt were appropriate for the topic.