Java Reference
In-Depth Information
• Whether parentheses should be balanced
The default constructor does not force the final token to be a stop or expect that paren-
theses should be balanced. The sentence model needs to be used with a tokenizer. We will
use the default constructor of the IndoEuropeanTokenizerFactory class for this
purpose, as shown here:
TokenizerFactory TOKENIZER_FACTORY=
IndoEuropeanTokenizerFactory.INSTANCE;
SentenceModel sentenceModel = new
IndoEuropeanSentenceModel();
A tokenizer is created and its tokenize method is invoked to populate two lists:
List<String> tokenList = new ArrayList<>();
List<String> whiteList = new ArrayList<>();
Tokenizer tokenizer= TOKENIZER_FACTORY.tokenizer(
paragraph.toCharArray(),0, paragraph.length());
tokenizer.tokenize(tokenList, whiteList);
The boundaryIndices method returns an array of integer boundary indexes. The
method requires two String array arguments containing tokens and whitespaces. The
tokenize method used two List for these elements. This means we need to convert
the List into equivalent arrays, as shown here:
String[] tokens = new String[tokenList.size()];
String[] whites = new String[whiteList.size()];
tokenList.toArray(tokens);
whiteList.toArray(whites);
We can then use the boundaryIndices method and display the indexes:
int[] sentenceBoundaries=
sentenceModel.boundaryIndices(tokens, whites);
for(int boundary : sentenceBoundaries) {
System.out.println(boundary);
}
The output is shown here:
Search WWH ::




Custom Search