Java Reference
In-Depth Information
Various statistics are displayed followed by the tokens marked up with position informa-
tion in the output, which is as follows:
Sentence #1 (8 tokens):
Let's pause,
and then reflect.
[Text=Let CharacterOffsetBegin=0 CharacterOffsetEnd=3]
[Text='s CharacterOffsetBegin=3 CharacterOffsetEnd=5]
[Text=pause CharacterOffsetBegin=6 CharacterOffsetEnd=11]
[Text=, CharacterOffsetBegin=11 CharacterOffsetEnd=12]
[Text=and CharacterOffsetBegin=14 CharacterOffsetEnd=17]
[Text=then CharacterOffsetBegin=18 CharacterOffsetEnd=22]
[Text=reflect CharacterOffsetBegin=23
CharacterOffsetEnd=30] [Text=. CharacterOffsetBegin=30
CharacterOffsetEnd=31]
Using LingPipe tokenizers
LingPipe supports a number of tokenizers. In this section, we will illustrate the use of the
IndoEuropeanTokenizerFactory class. In later sections, we will demonstrate
other ways that LingPipe supports tokenization. Its INSTANCE field provides an instance
of an Indo-European tokenizer. The tokenizer method returns an instance of a
Tokenizer class based on the text to be processed, as shown here:
char text[] = paragraph.toCharArray();
TokenizerFactory tokenizerFactory =
IndoEuropeanTokenizerFactory.INSTANCE;
Tokenizer tokenizer = tokenizerFactory.tokenizer(text, 0,
text.length);
for (String token : tokenizer) {
System.out.println(token);
}
The output is as follows:
Let
'
s
pause
Search WWH ::




Custom Search