Java Reference
In-Depth Information
A maximum entropy tagger uses statistics to determine the POS for a word and often uses
a corpus to train a model. A corpus is a collection of words marked up with POS tags.
Corpora exist for a number of languages. These take a lot of effort to develop. Frequently
used corpora include the Penn Treebank ( http://www.cis.upenn.edu/~treebank/ ) or Brown
Corpus ( http://www.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/cor-
pora/list/private/brown/brown.html ) .
A sample from the Penn Treebank corpus, which illustrates POS markup, is as follows:
Well/UH what/WP do/VBP you/PRP think/VB about/IN
the/DT idea/NN of/IN ,/, uh/UH ,/, kids/NNS having/VBG
to/TO do/VB public/JJ service/NN work/NN for/IN a/DT
year/NN ?/.
There are traditionally nine parts of speech in English: noun, verb, article, adjective, pre-
position, pronoun, adverb, conjunction, and interjection. However, a more complete ana-
lysis often requires additional categories and subcategories. There have been as many as
150 different parts of speech identified. In some situations, it may be necessary to create
new tags. A short list is shown in the following table.
These are the tags we use frequently in this chapter:
Part
Meaning
NN Noun, singular or mass
DT Determiner
VB Verb, base form
VBD Verb, past tense
VBZ Verb, third person singular present
IN Preposition or subordinating conjunction
NNP Proper noun, singular
Search WWH ::




Custom Search