Information Technology Reference
In-Depth Information
whether the quotation candidates contain a verb. Our investigations have shown that
quoted phrases with less than four words and without a verb are in most cases simply
highlighted text parts such as proper names. A direct quote may be composed of
several quotation candidates. That is why the component examines each quotation
candidate and decides whether it is the beginning of a new quotation or the part of
a compound quotation. It searches the environment of the quotation candidate for
incomplete sentences (sentences not ending with a period , exclamation or question
mark ) and reporting verbs. Incomplete preceding sentences are concatenated to the
quotation candidate. If the preceding sentence has been completed the component
checks whether it contains a reporting verb. Sentences with a reporting verb are
concatenated to the direct quotation candidate, because our experiments have shown
that these sentences often are reporting clauses that provide a quotation speaker.
Sentences following a quotation candidate are processed in a similar way. If a sen-
tence contains a reporting verb or is incomplete, it is concatenated to the quotation
candidate. Subsequent sentence parts containing the word “so” are also attached.
We cover in this way cases like “ '…', so Angela Merkel”. Quotation candidates are
connected to each other if a quotation candidate directly succeeds a reporting clause
or a quotation candidate.
1.3.3.4 Indirect Quotation Extraction
Reported speech is not put in quotation marks. It is composed of a main (reporting)
and a subordinate (reported) clause. In German, the reported clause often is intro-
duced by the conjunction “dass” and uses the subjunctive mood for verbs. In order to
extract indirect quotations from a news article we apply a rule-based approach. The
indirect quote extraction depends on the output of the direct quote collector. There-
fore, the indirect quote collection must succeed the direct quote collection in our
processing pipeline. Our approach is to first identify a reporting and a reported clause
and then construct the final indirect quotation. The indirect quote collector exploits
the occurrence of reporting verbs. To avoid duplicate quotation extraction (identi-
fying quotations as direct and indirect) the collector exclusively regards reporting
verbs that have not been already assigned to a direct quotation. If a detected reporting
verb is not already part of a direct quotation, we assume that the verb indicates the
reporting clause of an indirect quotation. We build up an indirect quotation by analyz-
ing the surrounding sentences or sentence parts. A strong indicator for the reported
clause is the presence of the conjunction “dass” (that) together with the finite verbs
“sei, seien, habe, werde, würde, würden” that are usually used in reported clauses to
repeat what someone has said. The occurrence of “dass” and one of the verbs implies
a reported clause and we infer a quotation. The quotation encompasses the reporting
and the reported clause. Sentences containing a reporting verb in the reporting clause
but missing “dass” in the reported clause are treated in the same way, if they contain
the finite verbs mentioned above. We also detect indirect quote indicated by ', so'
(as) and ', hieß es' (it was said).
Search WWH ::




Custom Search