Information Technology Reference
In-Depth Information
1.3.3.5 Quotation Holder Extraction
The aim of the quotation holder extraction is to attribute speakers to the identified
direct and indirect quote. Our approach is based on the observation that quotation
holders in most cases are named entities or references to named entities that are
mentioned nearest the reporting verb. For example, we choose the pronoun 'er' (he)
regarding the fragment ', sagte er dem Spiegel' (, he said to 'Der Spiegel'). To deter-
mine a quotation's holder we first create a set of candidates. As candidates we consider
named entities, pronouns (only “er” (he) and “sie” (she)) and noun chunks from the
reporting clause. We exclude candidates originating from the reported clause. Then,
we sort the list by proximity to the reporting verb but prioritize named entities and
pronouns over noun chunks. Pronouns are still left in order with named entities, so
that passages like “, sagte er zu Angela Merkel ”( he said to Angela Merkel )do
not get assigned to the wrong holder. If no reporting verb has been assigned to the
quotation we search for the word “so” in the reporting clause and sort the candidates
according to how near they are placed to the word “so”. Concerning direct quota-
tions there also may be quotations without a reporting verb and the word “so”, since
they are detected with the aid of quotation marks. In this case we simply select the
candidate nearest to the reported clause. Our approach to quotation holder extraction
also includes a simple form of co-reference resolution. If we determine a person as
quotation holder we attempt to resolve its name to the longest form of it in the text.
If the assigned speaker is a pronoun then we choose the first named entity before the
quotation.
1.3.4 Corpus
We manually annotated a corpus of 714 news articles containing direct and reported
speech. The news articles are all in German and were published over a time period
of three months from February 23, 2012 to May 21, 2012. The corpus allows the
evaluation of determining quotation text boundaries and of recognizing reporting
verbs and quotation holders.
For the annotation process we had to assure a sufficient coverage of direct and
reported speech. That is why we preprocessed the news stream provided by Neofonie
GmbH and preselected some news documents before we started with the annotation
procedure. We automatically detected different types of quotation marks and a set of
predefined reporting verbs within the news articles. Thereafter we randomly sampled
1,000 news articles. We chose:
250 news articles containing at least one direct quotation (text passages identified
by the occurrence of quotation marks and that are longer than 24 characters)
250 news articles containing at least one of the following reporting verbs: “sagte”,
“berichtete”, “berichteten”, “gestand”, “erklärte”, “erklärten”
500 news articles without any restrictions.
Search WWH ::




Custom Search