Java Reference
In-Depth Information
Using the BreakIterator class
Another approach for tokenization involves the use of the
BreakIterator
class. This
class supports the location of integer boundaries for different units of text. In this section,
we will illustrate how it can be used to find words.
The class has a single default constructor which is protected. We will use the static
getWordInstance
method to get an instance of the class. This method is overloaded
with one version using a
Locale
object. The class possesses several methods to access
boundaries as listed in the following table. It has one field,
DONE
, that is used to indicate
that the last boundary has been found.
Method
Usage
first
Returns the first boundary of the text
next
Returns the next boundary following the current one
previous
Returns the boundary preceding the current one
setText
Associates a string with the
BreakIterator
instance
To demonstrate this class, we declare an instance of the
BreakIterator
class and a
string to use with it:
BreakIterator wordIterator = BreakIterator.getWordInstance();
String text = "Let's pause, and then reflect.";
The text is then assigned to the instance and the first boundary is determined:
wordIterator.setText(text);
int boundary = wordIterator.first();
The loop that follows will store the beginning and ending boundary indexes for word
breaks using the
begin
and
end
variables. The boundary values are integers. Each bound-
ary pair and its associated text are displayed.
When the last boundary is found, the loop terminates: