Finding Parts of Text - Natural Language Processing with Java

Java Reference

In-Depth Information

Various statistics are displayed followed by the tokens marked up with position informa-

tion in the output, which is as follows:

Sentence #1 (8 tokens):

Let's pause,

and then reflect.

[Text=Let CharacterOffsetBegin=0 CharacterOffsetEnd=3]

[Text='s CharacterOffsetBegin=3 CharacterOffsetEnd=5]

[Text=pause CharacterOffsetBegin=6 CharacterOffsetEnd=11]

[Text=, CharacterOffsetBegin=11 CharacterOffsetEnd=12]

[Text=and CharacterOffsetBegin=14 CharacterOffsetEnd=17]

[Text=then CharacterOffsetBegin=18 CharacterOffsetEnd=22]

[Text=reflect CharacterOffsetBegin=23

CharacterOffsetEnd=30] [Text=. CharacterOffsetBegin=30

CharacterOffsetEnd=31]

Using LingPipe tokenizers

LingPipe supports a number of tokenizers. In this section, we will illustrate the use of the

IndoEuropeanTokenizerFactory class. In later sections, we will demonstrate

other ways that LingPipe supports tokenization. Its INSTANCE field provides an instance

of an Indo-European tokenizer. The tokenizer method returns an instance of a

Tokenizer class based on the text to be processed, as shown here:

char text[] = paragraph.toCharArray();

TokenizerFactory tokenizerFactory =

IndoEuropeanTokenizerFactory.INSTANCE;

Tokenizer tokenizer = tokenizerFactory.tokenizer(text, 0,

text.length);

for (String token : tokenizer) {

System.out.println(token);

}

The output is as follows:

Let

'

s

pause

Search WWH ::

Custom Search

Home