Java Reference
In-Depth Information
Comparing this to the original test, we see that it does a pretty good job:
Similar to stemming is Lemmatization. This is the process
of finding its lemma, its form as found in a dictionary.
Using lemmatization in OpenNLP
OpenNLP also supports lemmatization using the JWNLDictionary class. This class'
constructor uses a string that contains the path of the dictionary files used to identify
roots. We will use a WordNet dictionary developed at Princeton University ( word-
net.princeton.edu ) . The actual dictionary is a series of files stored in a directory. These
files contain a list of words and their "root". For the example used in this section, we will
use the dictionary found at https://code.google.com/p/xssm/downloads/de-
tail?name=SimilarityUtils.zip&can=2&q= .
The JWNLDictionary class' getLemmas method is passed the word we want to pro-
cess and a second parameter that specifies the POS for the word. It is important that the
POS match the actual word type if we want accurate results.
In the next code sequence, we create an instance of the JWNLDictionary class using a
path ending with \\dict\\ . This is the location of the dictionary. We also define our
sample text. The constructor can throw IOException and JWNLException , which
we deal with in a try-catch block sequence:
try {
dictionary = new JWNLDictionary("…\\dict\\");
paragraph = "Eat, drink, and be merry, for life is but
a dream";
} catch (IOException | JWNLException ex)
//
}
Following the text initialization, add the following statements. First, we tokenize the
string using the WhitespaceTokenizer class as explained in the section Using the
WhitespaceTokenizer class . Then, each token is passed to the getLemmas method with
an empty string as the POS type. The original token and its lemmas are then displayed:
String tokens[] =
WhitespaceTokenizer.INSTANCE.tokenize(paragraph);
Search WWH ::




Custom Search