Information Technology Reference
In-Depth Information
tor. The sentence detector uses a predefined model trained on Tiger corpus data [ 6 ]
for the German language together with a self-developed heuristic, i.e., we created
a set of likely ('.', '?' and '!') and unlikely (Fr., Hr., Prof. …) ends of a sentence
and check the output of the OpenNLP sentence detector. If a sentence chunk has
an ending that should never be an ending we merge again sentence parts that were
incorrectly split.
Lemmatization . Determining the lemma of a word is a necessary step for our
lexicon-based reporting verb finder described in (Sect. 1.3.3.2 ). In German the lemma
of a noun normally is the “nominative singular” and of a verb it is the “infinitive
present active” form. In our quotation extraction pipeline we make use of a mor-
phological lemmatizer that looks up the words of the news article in a lexicon. The
lexicon 11
was generated with “Morphy”, 12
a software tool for the morphological
analysis of German text [ 25 ].
Part-of-Speech Tagging . Part-of-speech tagging is the task of predicting the
grammatical category (noun, verb, adjective, …) of a word based on the word's
definition and its surrounding context. In computer linguistics each word of a sentence
is assigned a label from a predefined set of part-of-speech labels. For German often
the “Stuttgart-Tübingen-Tagset” (STTS) 13 is used for labeling [ 44 ]. The proposed
quotation extraction approach works with the Apache OpenNLP maximum entropy
part-of-speech tagger. Together with the predefined model trained on the Tiger corpus
the tagger predicts STTS labels for words of a given text.
Noun and Verb Chunking . The chunking component analyzes sentences and
determines verb and noun groups. The groups then serve as input for the recognition
of potential reporting verbs or quotation holders. For example, a speaker may be
referenced as “die deutsche Bundeskanzlerin” (the German chancellor) or a reporting
verb may be a compound of two words like “teilte mit” (informed). The phrases output
by the chunker help to determine the correct boundaries. Our processing pipeline
uses the Apache Open NLP maximum-entropy-based chunker. To recognize noun
chunks we use an out-of-the box model distributed by Gunnar Aastrand Grimnes. 14
For the recognition of verb chunks we trained a model on the Tiger Corpus [ 6 ].
The Tiger Corpus contains 50,000 sentences in German taken from the “Frankfurter
Rundschau” which are POS-tagged and annotated with syntactic structure.
Named Entity Recognition . When citing persons or organizations a pronoun, a
noun phrase or the proper name of an entity can be used to reference the quotation
speaker. State-of-the-art named entity recognizers mainly detect top-level entities
like persons, organizations, and locations which may be a starting point for the
detection of quotation holders. We integrated the Stanford named entity recognizer
into our quotation extraction pipeline. The Stanford recognizer is implemented as a
Conditional Random Field classifier [ 16 ]. We use a pre-trained model for German
provided by [ 15 ] that labels tokens as person, organization, location, and miscel-
11
http://www.danielnaber.de/morphologie/ .
12
http://www.wolfganglezius.de/doku.php?id=cl:morphy .
13
http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-table.html .
14
http://gromgull.net/blog/category/machine-learning/nlp/ .
Search WWH ::




Custom Search