Java Reference
In-Depth Information
Using stemming
Finding the stem of a word involves removing any prefixes or suffixes and what is left is
considered to be the stem. Identifying stems is useful for tasks where finding similar words
is important. For example, a search may be looking for occurrences of words like "book".
There are many words that contain this word including books, booked, bookings, and book-
mark. It can be useful to identify stems and then look for their occurrence in a document. In
many situations, this can improve the quality of a search.
A stemmer may produce a stem that is not a real word. For example, it may decide that
bounties, bounty, and bountiful all have the same stem, "bounti". This can still be useful for
searches.
Note
Similar to stemming is Lemmatization . This is the process of finding its lemma , its form
as found in a dictionary. This can also be useful for some searches. Stemming is frequently
viewed as a more primitive technique, where the attempt to get to the "root" of a word in-
volves cutting off parts of the beginning and/or ending of a token.
Lemmatization can be thought of as a more sophisticated approach where effort is devoted
to finding the morphological or vocabulary meaning of a token. For example, the word
"having" has a stem of "hav" while its lemma is "have". Also, the words "was" and "been"
have different stems but the same lemma, "be".
Lemmatization can often use more computational resources than stemming. They both
have their place and their utility is partially determined by the problem that needs to be
solved.
Using the Porter Stemmer
The Porter Stemmer is a commonly used stemmer for English. Its home page can be
found at http://tartarus.org/martin/PorterStemmer/ . It uses five steps to stem a word.
Although Apache OpenNLP 1.5.3 does not contain the PorterStemmer class, its source
code can be downloaded from https://svn.apache.org/repos/asf/opennlp/trunk/opennlp-
tools/src/main/java/opennlp/tools/stemmer/PorterStemmer.java . It can then be added to
your project.
Search WWH ::




Custom Search