Intelligent News Aggregator for German with Sentiment Analysis - Smart Information Systems: Computational Intelligence for Real-Life Applications

Information Technology Reference

In-Depth Information

De La Clergerie et al. [ 11 ] present an approach to quotation extraction from

French news articles. Their rule-based approach includes a comprehensive linguistic

processing chain with a deep parser. A postprocessing component constructs direct

and mixed quotations based on parsing results, 230 quotation verbs, and direct speech

parts signaled by quotation marks that were retrieved in previous processing steps.

As with [ 11 ], the authors in [ 52 ] focus their quotation extraction approach on French

news articles. Again, a rule-based approach is driven that exploits an automatically

created lexicon of reporting verbs. The authors recognize 16 patterns matching indi-

rect quotations and implement them as an unlexicalized grammar using an finite state

machine.

Besides the rule-based systems [ 1 , 10 , 11 , 24 , 26 , 41 , 45 , 52 ] a range of super-

vised approaches has been presented for the task of quotation extraction [ 35 , 38 ,

39 ]. Fernandes et al. [ 39 ] propose a supervised solution using an Entropy Guided

Transformation Learning (ETL) algorithm. They automatically generate rules instead

of manually designing them. The work regards quotation extraction as a two-task

problem. First, their system identifies quotations and, second, the quotations are asso-

ciated with a speaker. Recognized named entities and the output of a co-reference

component serve as a basis for the speaker assignment. To solve the subtasks different

sets of features (named entities, terms, co-references, part-of-speech tags, etc.) are

applied to the ETL algorithm. The developed system is capable of extracting direct

and mixed quotations from Portuguese news articles. In order to train their system,

the authors create the GloboQuotes corpus.

The approach to quotation extraction from English texts proposed by O'Keefe

et al. [ 35 ] makes use of supervised techniques as well. The authors solve the quo-

tation extraction part by using a regular expression looking for text between quo-

tation marks. Regarding quote attribution, which means finding the speaker of a

quote, they cast the problem to a sequence labeling task. Inspired by Elson and

McKeown [ 13 ], the authors encode news articles by replacing specific terms with

symbols and by removing unnecessary information. Then, a set of features is cal-

culated which includes distance, paragraph, nearby, quote, and sequence features,

again following Elson and McKeown [ 13 ]. In order to efficiently predict the target

speaker from a list of candidate speakers, the authors compare different types of

class models and sequence decoding. They examine the effects of creating feature

sets with and without gold standard labels. They conduct their experiments on three

different datasets and find that when leaving out gold labels for feature calculation

the performance drops significantly for classic literature but remains comparable

regarding news articles.

Pareti et al. [ 38 ] focus their work on the extraction of indirect and mixed quotations

from English-language news articles. The authors explore two supervised algorithms,

namely a Conditional Random Fields (CRF) and a Maximum Entropy (ME) clas-

sifier. The token-based CRF classifier predicts IOB labels (I-inside, O-outside or

B-beginning), marking the beginning and the end of a quotation, whereas the ME

classifier decides whether a phrase-structure parse node is or is not a quotation. The

classifiers largely rely on the same features but also incorporate classifier-dependent

features. Instead of using a predefined list of reporting verbs, the authors train a

Smart Information Systems: Computational Intelligence for Real-Life Applications

Search WWH ::

Custom Search

Home