Information Technology Reference
In-Depth Information
Reporting
Verb
Detection
Document
Preprocessing
Direct Quotes
Extraction
Detection of verbs
introducing direct
and indirect quotes
like „sagte“ (said)
Extraction of quoted
speech employing
pattern matching
and rules
Correction and
normalization of
quotation marks
Indirect
Quotes
Extraction
Quotation Holder
Extraction
Quotation
Postprocessing
Calculating the
confidence of the
extracted quotation
Rule-based
Extraction of
reported speech
Rule-based
extraction of the
quotation speaker
Fig. 1.3
The components of the quotation extraction pipeline
meta information or missing quotation marks. Regarding quotation extraction, texts
released by different publishers may also contain varying styles of quotation marks
like
, “ ” or ' '. Since quotation marks are crucial indicators for both direct speech
and quotation boundaries, the correction and normalization of quotation marks is
an important subtask in quotation extraction systems. In our system a document
preprocessing component corrects errors arising from inconsistent quotation marks.
It first replaces all quotation marks with uniform quotation marks and then counts
the number of quotation marks. The component does not patch texts with an odd
number of quotation marks, but adjusts quotations with different start and ending
quotation marks like quotations starting e.g., with “and ending with '.
Sentence Detection . Quotations may consist of several sentences or sentence
parts. For example, the quotation “Wir sind noch immer hier. Wir kämpfen noch
immer”, sagte Santorum. (“We are still here. We are still fighting”, Santorum said.)
is composed of two sentences. In such case the quotation extraction must recognize
both parts and determine correct quotation boundaries. A sentence detection is there-
fore an essential preprocessing step. Furthermore, other linguistic algorithms used in
news text analysis require sentences as a basis for their calculations. Our quotation
extraction pipeline uses the Apache OpenNLP 10 Maximum Entropy sentence detec-
10
https://opennlp.apache.org/ .
Search WWH ::




Custom Search