Java Reference
In-Depth Information
Using lemmatization
Lemmatization is supported by a number of NLP APIs. In this section, we will illustrate
how lemmatization can be performed using the
StanfordCoreNLP
and the
OpenNLPLemmatizer
classes. The lemmatization process determines the lemma of a
word. A lemma can be thought of as the dictionary form of a word. For example, the
lemma of "was" is "be".
Using the StanfordLemmatizer class
We will use the
StanfordCoreNLP
class with a pipeline to demonstrate lemmatization.
We start by setting up the pipeline with four annotators including
lemma
as shown here:
StanfordCoreNLP pipeline;
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma");
pipeline = new StanfordCoreNLP(props);
These annotators are needed and are explained as follows:
Annotator
Operation to be Performed
tokenize
Tokenization
Sentence splitting
ssplit
POS tagging
pos
lemma
Lemmatization
ner
NER
parse
Syntactic parsing
dcoref
Coreference resolution
A
paragraph
variable is used with the
Annotation
constructor and the
annotate
method is then executed, as shown here: