Information Technology Reference
In-Depth Information
De La Clergerie et al. [ 11 ] present an approach to quotation extraction from
French news articles. Their rule-based approach includes a comprehensive linguistic
processing chain with a deep parser. A postprocessing component constructs direct
and mixed quotations based on parsing results, 230 quotation verbs, and direct speech
parts signaled by quotation marks that were retrieved in previous processing steps.
As with [ 11 ], the authors in [ 52 ] focus their quotation extraction approach on French
news articles. Again, a rule-based approach is driven that exploits an automatically
created lexicon of reporting verbs. The authors recognize 16 patterns matching indi-
rect quotations and implement them as an unlexicalized grammar using an finite state
machine.
Besides the rule-based systems [ 1 , 10 , 11 , 24 , 26 , 41 , 45 , 52 ] a range of super-
vised approaches has been presented for the task of quotation extraction [ 35 , 38 ,
39 ]. Fernandes et al. [ 39 ] propose a supervised solution using an Entropy Guided
Transformation Learning (ETL) algorithm. They automatically generate rules instead
of manually designing them. The work regards quotation extraction as a two-task
problem. First, their system identifies quotations and, second, the quotations are asso-
ciated with a speaker. Recognized named entities and the output of a co-reference
component serve as a basis for the speaker assignment. To solve the subtasks different
sets of features (named entities, terms, co-references, part-of-speech tags, etc.) are
applied to the ETL algorithm. The developed system is capable of extracting direct
and mixed quotations from Portuguese news articles. In order to train their system,
the authors create the GloboQuotes corpus.
The approach to quotation extraction from English texts proposed by O'Keefe
et al. [ 35 ] makes use of supervised techniques as well. The authors solve the quo-
tation extraction part by using a regular expression looking for text between quo-
tation marks. Regarding quote attribution, which means finding the speaker of a
quote, they cast the problem to a sequence labeling task. Inspired by Elson and
McKeown [ 13 ], the authors encode news articles by replacing specific terms with
symbols and by removing unnecessary information. Then, a set of features is cal-
culated which includes distance, paragraph, nearby, quote, and sequence features,
again following Elson and McKeown [ 13 ]. In order to efficiently predict the target
speaker from a list of candidate speakers, the authors compare different types of
class models and sequence decoding. They examine the effects of creating feature
sets with and without gold standard labels. They conduct their experiments on three
different datasets and find that when leaving out gold labels for feature calculation
the performance drops significantly for classic literature but remains comparable
regarding news articles.
Pareti et al. [ 38 ] focus their work on the extraction of indirect and mixed quotations
from English-language news articles. The authors explore two supervised algorithms,
namely a Conditional Random Fields (CRF) and a Maximum Entropy (ME) clas-
sifier. The token-based CRF classifier predicts IOB labels (I-inside, O-outside or
B-beginning), marking the beginning and the end of a quotation, whereas the ME
classifier decides whether a phrase-structure parse node is or is not a quotation. The
classifiers largely rely on the same features but also incorporate classifier-dependent
features. Instead of using a predefined list of reporting verbs, the authors train a
Search WWH ::




Custom Search