Finding Parts of Text - Natural Language Processing with Java - page 53

Java Reference

In-Depth Information

Using the BreakIterator class

Another approach for tokenization involves the use of the BreakIterator class. This

class supports the location of integer boundaries for different units of text. In this section,

we will illustrate how it can be used to find words.

The class has a single default constructor which is protected. We will use the static

getWordInstance method to get an instance of the class. This method is overloaded

with one version using a Locale object. The class possesses several methods to access

boundaries as listed in the following table. It has one field, DONE , that is used to indicate

that the last boundary has been found.

Method

Usage

first

Returns the first boundary of the text

next

Returns the next boundary following the current one

previous Returns the boundary preceding the current one

setText Associates a string with the BreakIterator instance

To demonstrate this class, we declare an instance of the BreakIterator class and a

string to use with it:

BreakIterator wordIterator = BreakIterator.getWordInstance();

String text = "Let's pause, and then reflect.";

The text is then assigned to the instance and the first boundary is determined:

wordIterator.setText(text);

int boundary = wordIterator.first();

The loop that follows will store the beginning and ending boundary indexes for word

breaks using the begin and end variables. The boundary values are integers. Each bound-

ary pair and its associated text are displayed.

When the last boundary is found, the loop terminates:

Next Page

Natural Language Processing with Java

Search WWH ::

Custom Search

Home