Information Technology Reference
In-Depth Information
1.3.2 Related Work
In the past, numerous solutions for the task of quotation extraction from newspaper
material were proposed. The approaches differ in which technique they use, which
language they support, in whether they extract direct, mixed, or indirect quotations
(or all), and in how detailed they determine specific quotation units such as the
quotation holder and other circumstantial information.
The majority of previously published work detects reporting verbs in news articles
from a predefined or precompiled list and then extracts quotations based on rules
that are derived by experts. An exact analysis of the news material in advance and
the knowledge about the structure of quotations allow the identification of more or
less fine-grained patterns that may vary from language to language. Usually, the
patterns differ in the presence and position of lexical terms or syntactic information.
There is consensus that for each quote a speaker needs to be extracted, because the
information without the assignment of a speaker is of little use in most cases. Thus,
many researchers represent quotations as a triple consisting of the quoted text, the
quotation holder, and an optional reporting verb or a quotation introducing phrase.
The rule-based system presented by Pouliquen et al. [ 41 ] extracts around 2,600
direct quotations per day from a multilingual news stream. In order to keep the system
extensible to other languages, the approach does not rely on linguistic information
but on lexical patterns. The system recognizes quotation marks, reporting verbs, and
person names (along with further information such as temporal or spacial modifiers,
titles, and determiners) and applies three general and a couple of language-specific
rules to find quotations. A simple named entity disambiguation solution serves as the
accurate assignment of quotation holders. Still, the system misses quotations with
speakers referenced by pronouns, since it does not perform anaphora resolution.
Kresteletal.[ 24 ] assemble a set of six basic patterns to extract quotations from
news articles in English. They detect the most frequent reporting verbs using a finite
state transducer and implement the identified patterns as a regular grammar. Existing
GATE 9 components provide additional circumstantial information required during
the quotation extraction process. In contrast to [ 41 ], that limit their approach to direct
quotations, the authors treat indirect quotations as well.
The great part of the effort on quotation extraction and attribution has been made
for English texts [ 24 , 26 , 35 , 38 ]. Still, several publications focus their work on other
languages than English. In particular, quotation extraction for Portuguese [ 10 , 39 ]
and French has been studied [ 11 , 52 ].
Sarmento and Nunes [ 10 ] present a system that handles Portuguese news articles.
It finds direct and indirect quotations by applying 19 patterns and by exploiting a
list of 35 reporting verbs. The system does not implement anaphora resolution for
pronouns or noun phrases and therefore detects only speakers referenced by their
proper name. The authors evaluated their approach manually on 570 quotations
extracted by system.
9
https://gate.ac.uk/ .
Search WWH ::




Custom Search