Java Reference
In-Depth Information
Summary
POS tagging is a powerful technique for identifying the grammatical parts of a sentence. It
provides useful processing for downstream tasks such as question analysis and analyzing
the sentiment of text. We will return to this subject when we address parsing in Chapter 7 ,
Using a Parser to Extract Relationships .
Tagging is not an easy process due to the ambiguities found in most languages. The in-
creasing use of textese only makes the process more difficult. Fortunately, there are models
that can do a good job of identifying this type of text. However, as new terms and slang are
introduced, these models need to be kept up to date.
We investigated the use of OpenNLP, the Stanford API, and LingPipe in support of tagging.
These libraries used several different types of approaches to tag words including both rule-
based and model-based approaches. We saw how dictionaries can be used to enhance the
tagging process.
We briefly touched on the model training process. Pretagged sample texts are used as input
to the process and a model emerges as output. Although we did not address validation of
the model, this can be accomplished in a similar manner as accomplished in earlier
chapters.
The various POS tagger approaches can be compared based on a number of factors such as
their accuracy and how fast they run. Although we did not cover these issues here, there are
numerous web resources available. One comparison that examines how fast they run can be
found at http://mattwilkens.com/2008/11/08/evaluating-pos-taggers-speed/ .
In the next chapter, we will examine techniques to classify documents based on their con-
tent.
Search WWH ::




Custom Search