Java Reference
In-Depth Information
for (String token : tokens) {
String[] lemmas = dictionary.getLemmas(token, "");
for (String lemma : lemmas) {
System.out.println("Token: " + token + " Lemma: "
+ lemma);
}
}
The output is as follows:
Token: Eat, Lemma: at
Token: drink, Lemma: drink
Token: be Lemma: be
Token: life Lemma: life
Token: is Lemma: is
Token: is Lemma: i
Token: a Lemma: a
Token: dream Lemma: dream
The lemmatization process works well except for the token "is" that returns two lemmas.
The second one is not valid. This illustrates the importance of using the proper POS for a
token. We could have used one or more of the POS tags as the argument to the
getLem-
mas
method. However, this begs the question: how do we determine the correct POS?
This topic is discussed in detail in
Chapter 5
,
Detecting Parts of Speech
.
A short list of POS tags is found in the following table. This list is adapted from
ht-
plete list of The University of Pennsylvania (Penn) Treebank Tag-set can be found at
ht-
Tag
Description
JJ
Adjective
NN Noun, singular or mass
NNS Noun, plural