Java Reference
In-Depth Information
The output of this sequence is as follows. The numbers within the parentheses indicate the
tokens' beginning and ending positions:
Let (0-3)
's (3-5)
pause (6-11)
, (11-12)
and (14-17)
then (18-22)
reflect (23-30)
. (30-31)
Using the DocumentPreprocessor class
The DocumentPreprocessor class tokenizes input from an input stream. In addition,
it implements the Iterable interface making it easy to traverse the tokenized sequence.
The tokenizer supports the tokenization of simple text and XML data.
To illustrate this process, we will use an instance of StringReader class that uses the
paragraph string, as defined here:
Reader reader = new StringReader(paragraph);
An instance of the DocumentPreprocessor class is then instantiated:
DocumentPreprocessor documentPreprocessor =
new DocumentPreprocessor(reader);
The DocumentPreprocessor class implements the Iterable<-
java.util.List<HasWord>> interface. The HasWord interface contains two
methods that deal with words: a setWord and a word method. The latter method returns
a word as a string. In the next code sequence, the DocumentPreprocessor class
splits the input text into sentences which are stored as a List<HasWord> . An Iter-
ator object is used to extract a sentence and then a for-each statement will display the
tokens:
Iterator<List<HasWord>> it =
documentPreprocessor.iterator();
while (it.hasNext()) {
Search WWH ::




Custom Search