Advanced Analytical Theory and Methods: Text Analysis - Data Science and Big Data Analytics

Database Reference

In-Depth Information

Part-of-Speech (POS ) Tagging, Lemmatization, and

Stemming

The goal of POS tagging is to build a model whose input is a sentence, such as:

he saw a fox

and whose output is a tag sequence. Each tag marks the POS for the

corresponding word, such as:

PRP VBD DT NN

according to the Penn Treebank POS tags [3]. Therefore, the four words are

mapped to pronoun (personal), verb (past tense), determiner, and noun

(singular), respectively.

Both lemmatization and stemming are techniques to reduce the number of

dimensions and reduce inflections or variant forms to the base form to more

accurately measure the number of times each word appears.

With the use of a given dictionary, lemmatization finds the correct

dictionary base form of a word. For example, given the sentence:

obesity causes many problems

the output of lemmatization would be:

obesity cause many problem

Different from lemmatization, stemming does not need a dictionary, and it

usually refers to a crude process of stripping affixes based on a set of heuristics

with the hope of correctly achieving the goal to reduce inflections or variant

forms. After the process, words are stripped to become stems . A stem is not

necessarily an actual word defined in the natural language, but it is sufficient

to differentiate itself from the stems of other words. A well-known rule-based

stemming algorithm is Porter's stemming algorithm . It defines a set of

production rules to iteratively transform words into their stems. For the

sentence shown previously:

obesity causes many problems

the output of Porter's stemming algorithm is:

obes caus mani problem

Search WWH ::

Custom Search

Home