Java Reference
In-Depth Information
Reader reader = new StringReader(paragraph);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
for (List sentence : dp) {
System.out.println(sentence);
}
On execution, we get the following output:
[When, determining, the, end, of, sentences, we, need, to,
consider, several, factors, .]
[Sentences, may, end, with, exclamation, marks, !]
[Or, possibly, questions, marks, ?]
[Within, sentences, we, may, find, numbers, like, 3.14159,
,, abbreviations, such, as, found, in, Mr., Smith, ,, and,
possibly, ellipses, either, within, a, sentence, ..., ,,
or, at, the, end, of, a, sentence, ...]
By default, PTBTokenizer is used to tokenize the input. The setTokenizerFact-
ory method can be used to specify a different tokenizer. There are several other methods
that can be useful, as detailed in the following table:
Method
Purpose
setElementDelimiter
Its argument specifies an XML element. Only the text inside of those elements will be processed.
setSentenceDelimiter
The processor will assume that the string argument is a sentence delimiter.
setSentenceFinalPuncWords Its string array argument specifies the end of sentences delimiters.
When used with whitespace models, if its argument is true , then empty sentences will be retained.
setKeepEmptySentences
The class can process either plain text or XML documents.
To demonstrate how an XML file can be processed, we will create a simple XML file
called XMLText.xml , containing the following data:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl"?>
<document>
Search WWH ::




Custom Search