Java Reference
In-Depth Information
• Convert the sentences to lowercase
• Remove stop words
• Create an internal index data structure
We will develop two classes to support the index data structure: Word and Positions .
We will also augment the StopWords class, developed in Chapter 2 , Finding Parts of
Text , to support an overloaded version of the removeStopWords method. The new ver-
sion will provide a more convenient method for removing stop words.
We start with a try-with-resources block to open streams for the sentence model, en-
sent.bin , and a file containing the contents of Twenty Thousand Leagues Under the
Sea by Jules Verne. The topic was downloaded from
164 and modified slightly to remove leading and trailing Gutenberg text to make it more
try (InputStream is = new FileInputStream(new File(
"C:/Current Books/NLP and Java/Models/en-sent.bin"));
FileReader fr = new FileReader("Twenty Thousands.txt");
BufferedReader br = new BufferedReader(fr)) {
} catch (IOException ex) {
// Handle exceptions
The sentence model is used to create an instance of the SentenceDetectorME class
as shown here:
SentenceModel model = new SentenceModel(is);
SentenceDetectorME detector = new SentenceDetectorME(model);
Next, we will create a string using a StringBuilder instance to support the detection
of sentence boundaries. The topic's file is read and added to the StringBuilder in-
stance. The sentDetect method is then applied to create an array of sentences, as
shown here:
String line;
StringBuilder sb = new StringBuilder();
while ((line = br.readLine()) != null) {
sb.append(line + " ");
Search WWH ::

Custom Search