Java Reference
In-Depth Information
• Convert the sentences to lowercase
• Remove stop words
• Create an internal index data structure
We will develop two classes to support the index data structure:
Word
and
Positions
.
Text
, to support an overloaded version of the
removeStopWords
method. The new ver-
sion will provide a more convenient method for removing stop words.
We start with a try-with-resources block to open streams for the sentence model,
en-
sent.bin
, and a file containing the contents of
Twenty Thousand Leagues Under the
Sea
by Jules Verne. The topic was downloaded from
http://www.gutenberg.org/ebooks/
164
and modified slightly to remove leading and trailing Gutenberg text to make it more
readable:
try (InputStream is = new FileInputStream(new File(
"C:/Current Books/NLP and Java/Models/en-sent.bin"));
FileReader fr = new FileReader("Twenty Thousands.txt");
BufferedReader br = new BufferedReader(fr)) {
…
} catch (IOException ex) {
// Handle exceptions
}
The sentence model is used to create an instance of the
SentenceDetectorME
class
as shown here:
SentenceModel model = new SentenceModel(is);
SentenceDetectorME detector = new SentenceDetectorME(model);
Next, we will create a string using a
StringBuilder
instance to support the detection
of sentence boundaries. The topic's file is read and added to the
StringBuilder
in-
stance. The
sentDetect
method is then applied to create an array of sentences, as
shown here:
String line;
StringBuilder sb = new StringBuilder();
while ((line = br.readLine()) != null) {
sb.append(line + " ");