Intelligent News Aggregator for German with Sentiment Analysis - Smart Information Systems: Computational Intelligence for Real-Life Applications

Information Technology Reference

In-Depth Information

tor. The sentence detector uses a predefined model trained on Tiger corpus data [ 6 ]

for the German language together with a self-developed heuristic, i.e., we created

a set of likely ('.', '?' and '!') and unlikely (Fr., Hr., Prof. …) ends of a sentence

and check the output of the OpenNLP sentence detector. If a sentence chunk has

an ending that should never be an ending we merge again sentence parts that were

incorrectly split.

Lemmatization . Determining the lemma of a word is a necessary step for our

lexicon-based reporting verb finder described in (Sect. 1.3.3.2 ). In German the lemma

of a noun normally is the “nominative singular” and of a verb it is the “infinitive

present active” form. In our quotation extraction pipeline we make use of a mor-

phological lemmatizer that looks up the words of the news article in a lexicon. The

lexicon 11

was generated with “Morphy”, 12

a software tool for the morphological

analysis of German text [ 25 ].

Part-of-Speech Tagging . Part-of-speech tagging is the task of predicting the

grammatical category (noun, verb, adjective, …) of a word based on the word's

definition and its surrounding context. In computer linguistics each word of a sentence

is assigned a label from a predefined set of part-of-speech labels. For German often

the “Stuttgart-Tübingen-Tagset” (STTS) 13 is used for labeling [ 44 ]. The proposed

quotation extraction approach works with the Apache OpenNLP maximum entropy

part-of-speech tagger. Together with the predefined model trained on the Tiger corpus

the tagger predicts STTS labels for words of a given text.

Noun and Verb Chunking . The chunking component analyzes sentences and

determines verb and noun groups. The groups then serve as input for the recognition

of potential reporting verbs or quotation holders. For example, a speaker may be

referenced as “die deutsche Bundeskanzlerin” (the German chancellor) or a reporting

verb may be a compound of two words like “teilte mit” (informed). The phrases output

by the chunker help to determine the correct boundaries. Our processing pipeline

uses the Apache Open NLP maximum-entropy-based chunker. To recognize noun

chunks we use an out-of-the box model distributed by Gunnar Aastrand Grimnes. 14

For the recognition of verb chunks we trained a model on the Tiger Corpus [ 6 ].

The Tiger Corpus contains 50,000 sentences in German taken from the “Frankfurter

Rundschau” which are POS-tagged and annotated with syntactic structure.

Named Entity Recognition . When citing persons or organizations a pronoun, a

noun phrase or the proper name of an entity can be used to reference the quotation

speaker. State-of-the-art named entity recognizers mainly detect top-level entities

like persons, organizations, and locations which may be a starting point for the

detection of quotation holders. We integrated the Stanford named entity recognizer

into our quotation extraction pipeline. The Stanford recognizer is implemented as a

Conditional Random Field classifier [ 16 ]. We use a pre-trained model for German

provided by [ 15 ] that labels tokens as person, organization, location, and miscel-

11

http://www.danielnaber.de/morphologie/ .

12

http://www.wolfganglezius.de/doku.php?id=cl:morphy .

13

http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-table.html .

14

http://gromgull.net/blog/category/machine-learning/nlp/ .

Smart Information Systems: Computational Intelligence for Real-Life Applications

Search WWH ::

Custom Search

Home