Intelligent News Aggregator for German with Sentiment Analysis - Smart Information Systems: Computational Intelligence for Real-Life Applications

Information Technology Reference

In-Depth Information

k-nearest neighbor (k-NN) classifier working with 20 feature types, that predicts

whether an identified verb group introduces a quotation or not. The authors conclude

that the token-based approach using a CRF classifier outperforms the rule-based

baseline as well as the constituent-based approach using the EM classifier for all

quotation types. Regarding quotation attribution, the authors transfer four methods

described in O'Keefe et al. [ 35 ] for direct quotation and find that they all are suitable

for indirect and mixed quotations.

There is few scientific work that aims the extraction of quotations exclusively from

German news articles. To the best of our knowledge only Pouliquen et al. and Akbik

and Schenck [ 1 ] deal with German language news articles. While Pouliquen et al.

include German into their multilingual system as one of many languages, Akbik and

Schenck present a system that automatically collects news from the main German

news sites and then extracts direct quotations from these news articles. Their approach

detects text between quotation marks as quote candidates and uses a named entity

recognizer to identify potential speakers. Then a set of heuristics is used to determine

the resulting quote-speaker tuples.

1.3.3 Approach

The proposed approach for extracting direct and reported speech from German news

articles is rule-based. For each quotation the system identifies a speaker, a report-

ing verb, or a preparative phrase (like “…, so Angela Merkel”.), and the quotation

text with all its parts. We divide the task into five subtasks and model our quotation

extraction approach as a processing pipeline where the news articles are annotated

in each step of the pipeline with further information. Figure 1.3 demonstrates the

included components and the working flow. Starting with a document preprocessing

component we perform linguistic analysis like part-of-speech tagging and lemmatiz-

ing that serve as a basis for further processing steps. The normalization of quotation

marks is important at this point as well. Detecting a reporting verb helps to identify

the reporting clauses and is also a strong indicator of indirect speech. We there-

fore search for them in the next step of our pipeline. Subsequently, we identify the

reporting clauses of direct and indirect quotations and determine the quotation parts

and exact boundaries of the entire quotation. Note that the boundaries of indirect or

mixed quotations may be ambiguous and in many cases difficult to recognize even

by humans. In the last step of our pipeline we attribute one or more quotation holders

to the previously identified quotations.

1.3.3.1 Document Preprocessing

Quotation marks normalization . News articles may contain malformed markup.

Especially in systems with automatic news harvesting from heterogeneous sources

the collected texts may be erroneous, e.g., in terms of incomplete articles, misplaced

Search WWH ::

Custom Search

Home