Combined Approaches - Natural Language Processing with Java

Java Reference

In-Depth Information

}

String sentences[] = detector.sentDetect(sb.toString());

For the modified version of the topic file, this method created an array with 14,859 sen-

tences.

Next, we used the toLowerCase method to convert the text to lowercase. This was

done to ensure that when stop words are removed, the method will catch all of them.

for (int i = 0; i < sentences.length; i++) {

sentences[i] = sentences[i].toLowerCase();

}

Converting to lowercase and removing stop words restricts searches. However, this is con-

sidered to be a feature of this implementation and can be adjusted for other implementa-

tions.

Next, the stop words are removed. As mentioned earlier, an overloaded version of the

removeStopWords method has been added to make it easier to use with this example.

The new method is shown here:

public String removeStopWords(String words) {

String arr[] =

WhitespaceTokenizer.INSTANCE.tokenize(words);

StringBuilder sb = new StringBuilder();

for (int i = 0; i < arr.length; i++) {

if (stopWords.contains(arr[i])) {

// Do nothing

} else {

sb.append(arr[i]+" ");

}

return sb.toString();

}

We created a StopWords instance using the stop-words_english_2_en.txt file

as shown in the following code sequence. This is one of several lists that can be down-

loaded from https://code.google.com/p/stop-words/ . We chose this file simply because it

contains stop words that we felt were appropriate for the topic.

Search WWH ::

Custom Search

Home