Java Reference
In-Depth Information
• Convert the sentences to lowercase
• Remove stop words
• Create an internal index data structure
We will develop two classes to support the index data structure: Word and Positions .
We will also augment the StopWords class, developed in Chapter 2 , Finding Parts of
Text , to support an overloaded version of the removeStopWords method. The new ver-
sion will provide a more convenient method for removing stop words.
We start with a try-with-resources block to open streams for the sentence model, en-
sent.bin , and a file containing the contents of Twenty Thousand Leagues Under the
Sea by Jules Verne. The topic was downloaded from http://www.gutenberg.org/ebooks/
164 and modified slightly to remove leading and trailing Gutenberg text to make it more
readable:
try (InputStream is = new FileInputStream(new File(
"C:/Current Books/NLP and Java/Models/en-sent.bin"));
FileReader fr = new FileReader("Twenty Thousands.txt");
BufferedReader br = new BufferedReader(fr)) {
} catch (IOException ex) {
// Handle exceptions
}
The sentence model is used to create an instance of the SentenceDetectorME class
as shown here:
SentenceModel model = new SentenceModel(is);
SentenceDetectorME detector = new SentenceDetectorME(model);
Next, we will create a string using a StringBuilder instance to support the detection
of sentence boundaries. The topic's file is read and added to the StringBuilder in-
stance. The sentDetect method is then applied to create an array of sentences, as
shown here:
String line;
StringBuilder sb = new StringBuilder();
while ((line = br.readLine()) != null) {
sb.append(line + " ");
Search WWH ::




Custom Search