Finding Sentences - Natural Language Processing with Java

Java Reference

In-Depth Information

• Whether parentheses should be balanced

The default constructor does not force the final token to be a stop or expect that paren-

theses should be balanced. The sentence model needs to be used with a tokenizer. We will

use the default constructor of the IndoEuropeanTokenizerFactory class for this

purpose, as shown here:

TokenizerFactory TOKENIZER_FACTORY=

IndoEuropeanTokenizerFactory.INSTANCE;

SentenceModel sentenceModel = new

IndoEuropeanSentenceModel();

A tokenizer is created and its tokenize method is invoked to populate two lists:

List<String> tokenList = new ArrayList<>();

List<String> whiteList = new ArrayList<>();

Tokenizer tokenizer= TOKENIZER_FACTORY.tokenizer(

paragraph.toCharArray(),0, paragraph.length());

tokenizer.tokenize(tokenList, whiteList);

The boundaryIndices method returns an array of integer boundary indexes. The

method requires two String array arguments containing tokens and whitespaces. The

tokenize method used two List for these elements. This means we need to convert

the List into equivalent arrays, as shown here:

String[] tokens = new String[tokenList.size()];

String[] whites = new String[whiteList.size()];

tokenList.toArray(tokens);

whiteList.toArray(whites);

We can then use the boundaryIndices method and display the indexes:

int[] sentenceBoundaries=

sentenceModel.boundaryIndices(tokens, whites);

for(int boundary : sentenceBoundaries) {

System.out.println(boundary);

}

The output is shown here:

Search WWH ::

Custom Search

Home