Java Reference
In-Depth Information
• Whether parentheses should be balanced
The default constructor does not force the final token to be a stop or expect that paren-
theses should be balanced. The sentence model needs to be used with a tokenizer. We will
use the default constructor of the
IndoEuropeanTokenizerFactory
class for this
purpose, as shown here:
TokenizerFactory TOKENIZER_FACTORY=
IndoEuropeanTokenizerFactory.INSTANCE;
SentenceModel sentenceModel = new
IndoEuropeanSentenceModel();
A tokenizer is created and its
tokenize
method is invoked to populate two lists:
List<String> tokenList = new ArrayList<>();
List<String> whiteList = new ArrayList<>();
Tokenizer tokenizer= TOKENIZER_FACTORY.tokenizer(
paragraph.toCharArray(),0, paragraph.length());
tokenizer.tokenize(tokenList, whiteList);
The
boundaryIndices
method returns an array of integer boundary indexes. The
method requires two
String
array arguments containing tokens and whitespaces. The
tokenize
method used two
List
for these elements. This means we need to convert
the
List
into equivalent arrays, as shown here:
String[] tokens = new String[tokenList.size()];
String[] whites = new String[whiteList.size()];
tokenList.toArray(tokens);
whiteList.toArray(whites);
We can then use the
boundaryIndices
method and display the indexes:
int[] sentenceBoundaries=
sentenceModel.boundaryIndices(tokens, whites);
for(int boundary : sentenceBoundaries) {
System.out.println(boundary);
}
The output is shown here: