Java Reference
In-Depth Information
A maximum entropy tagger uses statistics to determine the POS for a word and often uses
a corpus to train a model. A corpus is a collection of words marked up with POS tags.
Corpora exist for a number of languages. These take a lot of effort to develop. Frequently
used corpora include the Penn Treebank (
http://www.cis.upenn.edu/~treebank/
) or Brown
A sample from the Penn Treebank corpus, which illustrates POS markup, is as follows:
Well/UH what/WP do/VBP you/PRP think/VB about/IN
the/DT idea/NN of/IN ,/, uh/UH ,/, kids/NNS having/VBG
to/TO do/VB public/JJ service/NN work/NN for/IN a/DT
year/NN ?/.
There are traditionally nine parts of speech in English: noun, verb, article, adjective, pre-
position, pronoun, adverb, conjunction, and interjection. However, a more complete ana-
lysis often requires additional categories and subcategories. There have been as many as
150 different parts of speech identified. In some situations, it may be necessary to create
new tags. A short list is shown in the following table.
These are the tags we use frequently in this chapter:
Part
Meaning
NN
Noun, singular or mass
DT
Determiner
VB
Verb, base form
VBD
Verb, past tense
VBZ
Verb, third person singular present
IN
Preposition or subordinating conjunction
NNP
Proper noun, singular