Java Reference
In-Depth Information
What makes POS difficult?
There are many aspects of a language that can make POS tagging difficult. Most English
words will have two or more tags associated with them. A dictionary is not always suffi-
cient to determine a word's POS. For example, the meaning of words like "bill" and "force"
are dependent on their context. The following sentence demonstrates how they can both be
used in the same sentence as nouns and verbs.
"Bill used the force to force the manger to tear the bill in two."
Using the OpenNLP tagger with this sentence produces the following output:
Bill/NNP used/VBD the/DT force/NN to/TO force/VB the/DT
manger/NN to/TO tear/VB the/DT bill/NN in/IN two./PRP$
The use of textese , a combination of different forms of text including abbreviations, hasht-
ags, emoticons, and slang, in communications mediums such as tweets and text makes it
more difficult to tag sentences. For example, the following message is difficult to tag:
"AFAIK she H8 cth! BTW had a GR8 tym at the party BBIAM."
Its equivalent is:
"As far as I know, she hates cleaning the house! By the way, had a great time at the party.
Be back in a minute."
Using the OpenNLP tagger, we will get the following output:
AFAIK/NNS she/PRP H8/CD cth!/.
BTW/NNP had/VBD a/DT GR8/CD tym/NN at/IN the/DT party/NN
BBIAM./.
In the Using the MaxentTagger class to tag textese section later in this chapter, we will
provide a demonstration of how LingPipe can handle textese. A short list of textese is given
in the following table:
Phrase
Textese
Phrase
Textese
As far as I know
AFAIK
By the way
BTW
Search WWH ::




Custom Search