Java Reference
In-Depth Information
paragraph = "The colour of money is green. Common fraction "
+ "characters such as ½ are converted to the long form
1/2. "
+ "Quotes such as "cat" are converted to their simpler
form.";
ptb = new PTBTokenizer(
new StringReader(paragraph), new
CoreLabelTokenFactory(),
"americanize=true,normalizeFractions=true,asciiQuotes=true");
wtsp = new WordToSentenceProcessor();
sents = wtsp.process(ptb.tokenize());
for (List<CoreLabel> sent : sents) {
for (CoreLabel element : sent) {
System.out.print(element + " ");
}
System.out.println();
}
The output is as follows:
The color of money is green .
Common fraction characters such as 1/2 are converted to the
long form 1/2 .
Quotes such as " cat " are converted to their simpler form .
The British spelling of the word "colour" was converted to its American equivalent. The
fraction ½ was expanded to three characters: 1/2. In the last sentence, the smart quotes
were converted to their simpler form.
Using the DocumentPreprocessor class
When an instance of the DocumentPreprocessor class is created, it uses its Reader
parameter to produce a list of sentences. It also implements the Iterable interface,
which makes it easy to traverse the list.
In the following example, the paragraph is used to create a StringReader object, and
this object is used to instantiate the DocumentPreprocessor instance:
Search WWH ::




Custom Search