Java Reference
In-Depth Information
Using Stanford POS taggers
In this section, we will examine two different approaches supported by the Stanford API to
perform tagging. The first technique uses the
MaxentTagger
class. As its name implies,
it uses maximum entropy to find the POS. We will also use this class to demonstrate a
model designed to handle textese-type text. The second approach will use the pipeline ap-
proach with annotators. The English taggers use the Penn Treebank English POS tag set.
Using Stanford MaxentTagger
The
MaxentTagger
class uses a model to perform the tagging task. There are a number
of models that come bundled with the API, all with the file extension
.tagger
. They in-
clude English, Chinese, Arabic, French, and German models. The English models are listed
here. The prefix,
wsj
, refers to models based on the Wall Street Journal. The other terms
refer to techniques used to train the model. These concepts are not covered here:
•
wsj-0-18-bidirectional-distsim.tagger
•
wsj-0-18-bidirectional-nodistsim.tagger
•
wsj-0-18-caseless-left3words-distsim.tagger
•
wsj-0-18-left3words-distsim.tagger
•
wsj-0-18-left3words-nodistsim.tagger
•
english-bidirectional-distsim.tagger
•
english-caseless-left3words-distsim.tagger
•
english-left3words-distsim.tagger
The example reads in a series of sentences from a file. Each sentence is then processed and
various ways of accessing and displaying the words and tags are illustrated.
We start with a try-with-resources block to deal with IO exceptions as shown here. The
wsj-0-18-bidirectional-distsim.tagger
file is used to create an instance of
the
MaxentTagger
class.
A
List
instance of
List
instances of
HasWord
objects is created using the
Max-
entTagger
class'
tokenizeText
method. The sentences are read in from the file
sentences.txt
.The
HasWord
interface represents words and contains two methods: a
setWord
and a
word
method. The latter method returns a word as a string. Each sentence
is represented by a
List
instance of
HasWord
objects: