Java Reference
In-Depth Information
for (List<CoreLabel> sent : sents) {
int size = sent.size();
System.out.println(sent.get(size-1) + " "
+ sent.get(size-1).endPosition());
}
This will produce the following output:
. 74
! 116
? 145
... 317
There are a number of options available when the constructor of the PTBTokenizer
class is invoked. These options are enclosed as the constructor's third parameter. The op-
tion string consists of the options separated by commas, as shown here:
"americanize=true,normalizeFractions=true,asciiQuotes=true".
Several of these options are listed in this table:
Option
Meaning
Used to indicate that the tokens and whitespace must be preserved so that the original string can be re-
constructed
invertible
Indicates that the ends of lines must be treated as tokens
tokenizeNLs
If true, this will rewrite British spellings as American spellings
americanize
normalizeAmpersandEntity Will convert the XML &amp character to an ampersand
normalizeFractions
Converts common fraction characters such as ½ to the long form (1/2)
asciiQuotes
Will convert quote characters to the simpler ' and " characters
unicodeQuotes
Will convert quote characters to characters that range from U+2018 to U+201D
The following sequence illustrates the use of this option string;
Search WWH ::




Custom Search