Database Reference
In-Depth Information
Part-of-Speech (POS ) Tagging, Lemmatization, and
Stemming
The goal of POS tagging is to build a model whose input is a sentence, such as:
he saw a fox
and whose output is a tag sequence. Each tag marks the POS for the
corresponding word, such as:
PRP VBD DT NN
according to the Penn Treebank POS tags [3]. Therefore, the four words are
mapped to pronoun (personal), verb (past tense), determiner, and noun
(singular), respectively.
Both lemmatization and stemming are techniques to reduce the number of
dimensions and reduce inflections or variant forms to the base form to more
accurately measure the number of times each word appears.
With the use of a given dictionary, lemmatization finds the correct
dictionary base form of a word. For example, given the sentence:
obesity causes many problems
the output of lemmatization would be:
obesity cause many problem
Different from lemmatization, stemming does not need a dictionary, and it
usually refers to a crude process of stripping affixes based on a set of heuristics
with the hope of correctly achieving the goal to reduce inflections or variant
forms. After the process, words are stripped to become stems . A stem is not
necessarily an actual word defined in the natural language, but it is sufficient
to differentiate itself from the stems of other words. A well-known rule-based
stemming algorithm is Porter's stemming algorithm . It defines a set of
production rules to iteratively transform words into their stems. For the
sentence shown previously:
obesity causes many problems
the output of Porter's stemming algorithm is:
obes caus mani problem
Search WWH ::




Custom Search