Finding Parts of Text - Natural Language Processing with Java

Java Reference

In-Depth Information

Using the WhitespaceTokenizer class

As its name implies, this class uses whitespaces as delimiters. In the following code se-

quence, an instance of the tokenizer is created and the tokenize method is executed

against it using paragraph as input. The for statement then displays the tokens:

String tokens[] =

WhitespaceTokenizer.INSTANCE.tokenize(paragraph);

for (String token : tokens) {

System.out.println(token);

}

The output is as follows:

Let's

pause,

and

then

reflect.

Although this does not separate contractions and similar units of text, it can be useful for

some applications. The class also possesses a tokizePos method that returns boundar-

ies of the tokens.

Using the TokenizerME class

The TokenizerME class uses models created using Maximum Entropy ( maxent ) and a

statistical model to perform tokenization. The maxent model is used to determine the rela-

tionship between data, in our case, text. Some text sources, such as various social media,

are not well formatted and use a lot of slang and special symbols such as emoticons. A

statistical tokenizer, such as the maxent model, improves the quality of the tokenization

process.

Note

A detailed discussion of this model is not possible here due to its complexity. A good

starting point for an interested reader can be found at http://en.wikipedia.org/w/in-

Search WWH ::

Custom Search

Home