Combined Approaches - Natural Language Processing with Java

Java Reference

In-Depth Information

• Convert the sentences to lowercase

• Remove stop words

• Create an internal index data structure

We will develop two classes to support the index data structure: Word and Positions .

We will also augment the StopWords class, developed in Chapter 2 , Finding Parts of

Text , to support an overloaded version of the removeStopWords method. The new ver-

sion will provide a more convenient method for removing stop words.

We start with a try-with-resources block to open streams for the sentence model, en-

sent.bin , and a file containing the contents of Twenty Thousand Leagues Under the

Sea by Jules Verne. The topic was downloaded from http://www.gutenberg.org/ebooks/

164 and modified slightly to remove leading and trailing Gutenberg text to make it more

readable:

try (InputStream is = new FileInputStream(new File(

"C:/Current Books/NLP and Java/Models/en-sent.bin"));

FileReader fr = new FileReader("Twenty Thousands.txt");

BufferedReader br = new BufferedReader(fr)) {

…

} catch (IOException ex) {

// Handle exceptions

}

The sentence model is used to create an instance of the SentenceDetectorME class

as shown here:

SentenceModel model = new SentenceModel(is);

SentenceDetectorME detector = new SentenceDetectorME(model);

Next, we will create a string using a StringBuilder instance to support the detection

of sentence boundaries. The topic's file is read and added to the StringBuilder in-

stance. The sentDetect method is then applied to create an array of sentences, as

shown here:

String line;

StringBuilder sb = new StringBuilder();

while ((line = br.readLine()) != null) {

sb.append(line + " ");

Search WWH ::

Custom Search

Home