Java Reference
In-Depth Information
Using Stanford POS taggers
In this section, we will examine two different approaches supported by the Stanford API to
perform tagging. The first technique uses the MaxentTagger class. As its name implies,
it uses maximum entropy to find the POS. We will also use this class to demonstrate a
model designed to handle textese-type text. The second approach will use the pipeline ap-
proach with annotators. The English taggers use the Penn Treebank English POS tag set.
Using Stanford MaxentTagger
The MaxentTagger class uses a model to perform the tagging task. There are a number
of models that come bundled with the API, all with the file extension .tagger . They in-
clude English, Chinese, Arabic, French, and German models. The English models are listed
here. The prefix, wsj , refers to models based on the Wall Street Journal. The other terms
refer to techniques used to train the model. These concepts are not covered here:
wsj-0-18-bidirectional-distsim.tagger
wsj-0-18-bidirectional-nodistsim.tagger
wsj-0-18-caseless-left3words-distsim.tagger
wsj-0-18-left3words-distsim.tagger
wsj-0-18-left3words-nodistsim.tagger
english-bidirectional-distsim.tagger
english-caseless-left3words-distsim.tagger
english-left3words-distsim.tagger
The example reads in a series of sentences from a file. Each sentence is then processed and
various ways of accessing and displaying the words and tags are illustrated.
We start with a try-with-resources block to deal with IO exceptions as shown here. The
wsj-0-18-bidirectional-distsim.tagger file is used to create an instance of
the MaxentTagger class.
A List instance of List instances of HasWord objects is created using the Max-
entTagger class' tokenizeText method. The sentences are read in from the file
sentences.txt .The HasWord interface represents words and contains two methods: a
setWord and a word method. The latter method returns a word as a string. Each sentence
is represented by a List instance of HasWord objects:
Search WWH ::




Custom Search