Java Reference
In-Depth Information
The output of this sequence is as follows. The numbers within the parentheses indicate the
tokens' beginning and ending positions:
Let (0-3)
's (3-5)
pause (6-11)
, (11-12)
and (14-17)
then (18-22)
reflect (23-30)
. (30-31)
Using the DocumentPreprocessor class
The
DocumentPreprocessor
class tokenizes input from an input stream. In addition,
it implements the
Iterable
interface making it easy to traverse the tokenized sequence.
The tokenizer supports the tokenization of simple text and XML data.
To illustrate this process, we will use an instance of
StringReader
class that uses the
paragraph
string, as defined here:
Reader reader = new StringReader(paragraph);
An instance of the
DocumentPreprocessor
class is then instantiated:
DocumentPreprocessor documentPreprocessor =
new DocumentPreprocessor(reader);
The
DocumentPreprocessor
class implements the
Iterable<-
java.util.List<HasWord>>
interface. The
HasWord
interface contains two
methods that deal with words: a
setWord
and a
word
method. The latter method returns
a word as a string. In the next code sequence, the
DocumentPreprocessor
class
splits the input text into sentences which are stored as a
List<HasWord>
. An
Iter-
ator
object is used to extract a sentence and then a for-each statement will display the
tokens:
Iterator<List<HasWord>> it =
documentPreprocessor.iterator();
while (it.hasNext()) {