Java Reference
In-Depth Information
for (String token : tokens) {
String[] lemmas = dictionary.getLemmas(token, "");
for (String lemma : lemmas) {
System.out.println("Token: " + token + " Lemma: "
+ lemma);
}
}
The output is as follows:
Token: Eat, Lemma: at
Token: drink, Lemma: drink
Token: be Lemma: be
Token: life Lemma: life
Token: is Lemma: is
Token: is Lemma: i
Token: a Lemma: a
Token: dream Lemma: dream
The lemmatization process works well except for the token "is" that returns two lemmas.
The second one is not valid. This illustrates the importance of using the proper POS for a
token. We could have used one or more of the POS tags as the argument to the getLem-
mas method. However, this begs the question: how do we determine the correct POS?
This topic is discussed in detail in Chapter 5 , Detecting Parts of Speech .
A short list of POS tags is found in the following table. This list is adapted from ht-
tps://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html . The com-
plete list of The University of Pennsylvania (Penn) Treebank Tag-set can be found at ht-
tp://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html .
Tag
Description
JJ
Adjective
NN Noun, singular or mass
NNS Noun, plural
Search WWH ::




Custom Search