Java Reference
In-Depth Information
The tagging process
Tagging is the process of assigning a description to a token or a portion of text. This de-
scription is called a tag . POS tagging is the process of assigning a POS tag to a token.
These tags are normally tags such as noun, verb, and adjective.
For example, consider the following sentence:
"The cow jumped over the moon."
For many of these initial examples, we will illustrate the result of a POS tagger using the
OpenNLP tagger to be discussed in Using OpenNLP POS taggers , later in this chapter. If
we use that tagger with the previous example, we will get the following results. Notice that
the words are followed by a forward slash and then their POS tag. These tags will be ex-
plained shortly:
The/DT cow/NN jumped/VBD over/IN the/DT moon./NN
Words can potentially have more than one tag associated with them depending on their
context. For example, the word "saw" could be a noun or a verb. When a word can be clas-
sified into different categories, information such as its position, words in its vicinity, or
similar information are used to probabilistically determine the appropriate category. For ex-
ample, if a word is preceded by a determiner and followed by a noun, then tag the word as
an adjective.
The general tagging process consists of tokenizing the text, determining possible tags, and
resolving ambiguous tags. Algorithms are used to perform POS identification (tagging).
There are two general approaches:
Rule-based : Rule-based taggers uses a set of rules and a dictionary of words and
possible tags. The rules are used when a word has multiple tags. Rules often use
the previous and/or following words to select a tag.
Stochastic : Stochastic taggers use is either based on the Markov model or are cue-
based, which uses either decision trees or maximum entropy. Markov models are
finite state machines where each state has two probability distributions. Its object-
ive is to find the optimal sequence of tags for a sentence. Hidden Markov Models
( HMM ) are also used. In these models, the state transitions are not visible.
Search WWH ::




Custom Search