Finding Parts of Text - Natural Language Processing with Java

Java Reference

In-Depth Information

The output of this sequence is as follows. The numbers within the parentheses indicate the

tokens' beginning and ending positions:

Let (0-3)

's (3-5)

pause (6-11)

, (11-12)

and (14-17)

then (18-22)

reflect (23-30)

. (30-31)

Using the DocumentPreprocessor class

The DocumentPreprocessor class tokenizes input from an input stream. In addition,

it implements the Iterable interface making it easy to traverse the tokenized sequence.

The tokenizer supports the tokenization of simple text and XML data.

To illustrate this process, we will use an instance of StringReader class that uses the

paragraph string, as defined here:

Reader reader = new StringReader(paragraph);

An instance of the DocumentPreprocessor class is then instantiated:

DocumentPreprocessor documentPreprocessor =

new DocumentPreprocessor(reader);

The DocumentPreprocessor class implements the Iterable<-

java.util.List<HasWord>> interface. The HasWord interface contains two

methods that deal with words: a setWord and a word method. The latter method returns

a word as a string. In the next code sequence, the DocumentPreprocessor class

splits the input text into sentences which are stored as a List<HasWord> . An Iter-

ator object is used to extract a sentence and then a for-each statement will display the

tokens:

Iterator<List<HasWord>> it =

documentPreprocessor.iterator();

while (it.hasNext()) {

Search WWH ::

Custom Search

Home