Java Reference
In-Depth Information
Using the BreakIterator class
Another approach for tokenization involves the use of the BreakIterator class. This
class supports the location of integer boundaries for different units of text. In this section,
we will illustrate how it can be used to find words.
The class has a single default constructor which is protected. We will use the static
getWordInstance method to get an instance of the class. This method is overloaded
with one version using a Locale object. The class possesses several methods to access
boundaries as listed in the following table. It has one field, DONE , that is used to indicate
that the last boundary has been found.
Method
Usage
first
Returns the first boundary of the text
next
Returns the next boundary following the current one
previous Returns the boundary preceding the current one
setText Associates a string with the BreakIterator instance
To demonstrate this class, we declare an instance of the BreakIterator class and a
string to use with it:
BreakIterator wordIterator = BreakIterator.getWordInstance();
String text = "Let's pause, and then reflect.";
The text is then assigned to the instance and the first boundary is determined:
wordIterator.setText(text);
int boundary = wordIterator.first();
The loop that follows will store the beginning and ending boundary indexes for word
breaks using the begin and end variables. The boundary values are integers. Each bound-
ary pair and its associated text are displayed.
When the last boundary is found, the loop terminates:
Search WWH ::




Custom Search