Java Reference
In-Depth Information
}
String sentences[] = detector.sentDetect(sb.toString());
For the modified version of the topic file, this method created an array with 14,859 sen-
tences.
Next, we used the toLowerCase method to convert the text to lowercase. This was
done to ensure that when stop words are removed, the method will catch all of them.
for (int i = 0; i < sentences.length; i++) {
sentences[i] = sentences[i].toLowerCase();
}
Converting to lowercase and removing stop words restricts searches. However, this is con-
sidered to be a feature of this implementation and can be adjusted for other implementa-
tions.
Next, the stop words are removed. As mentioned earlier, an overloaded version of the
removeStopWords method has been added to make it easier to use with this example.
The new method is shown here:
public String removeStopWords(String words) {
String arr[] =
WhitespaceTokenizer.INSTANCE.tokenize(words);
StringBuilder sb = new StringBuilder();
for (int i = 0; i < arr.length; i++) {
if (stopWords.contains(arr[i])) {
// Do nothing
} else {
sb.append(arr[i]+" ");
}
}
return sb.toString();
}
We created a StopWords instance using the stop-words_english_2_en.txt file
as shown in the following code sequence. This is one of several lists that can be down-
loaded from https://code.google.com/p/stop-words/ . We chose this file simply because it
contains stop words that we felt were appropriate for the topic.
Search WWH ::




Custom Search