Finding Parts of Text - Natural Language Processing with Java

Java Reference

In-Depth Information

Comparing this to the original test, we see that it does a pretty good job:

Similar to stemming is Lemmatization. This is the process

of finding its lemma, its form as found in a dictionary.

Using lemmatization in OpenNLP

OpenNLP also supports lemmatization using the JWNLDictionary class. This class'

constructor uses a string that contains the path of the dictionary files used to identify

roots. We will use a WordNet dictionary developed at Princeton University ( word-

net.princeton.edu ) . The actual dictionary is a series of files stored in a directory. These

files contain a list of words and their "root". For the example used in this section, we will

use the dictionary found at https://code.google.com/p/xssm/downloads/de-

The JWNLDictionary class' getLemmas method is passed the word we want to pro-

cess and a second parameter that specifies the POS for the word. It is important that the

POS match the actual word type if we want accurate results.

In the next code sequence, we create an instance of the JWNLDictionary class using a

path ending with \\dict\\ . This is the location of the dictionary. We also define our

sample text. The constructor can throw IOException and JWNLException , which

we deal with in a try-catch block sequence:

try {

dictionary = new JWNLDictionary("…\\dict\\");

paragraph = "Eat, drink, and be merry, for life is but

a dream";

…

} catch (IOException | JWNLException ex)

//

}

Following the text initialization, add the following statements. First, we tokenize the

string using the WhitespaceTokenizer class as explained in the section Using the

WhitespaceTokenizer class . Then, each token is passed to the getLemmas method with

an empty string as the POS type. The original token and its lemmas are then displayed:

String tokens[] =

WhitespaceTokenizer.INSTANCE.tokenize(paragraph);

Search WWH ::

Custom Search

Home