Information Technology Reference
In-Depth Information
k-nearest neighbor (k-NN) classifier working with 20 feature types, that predicts
whether an identified verb group introduces a quotation or not. The authors conclude
that the token-based approach using a CRF classifier outperforms the rule-based
baseline as well as the constituent-based approach using the EM classifier for all
quotation types. Regarding quotation attribution, the authors transfer four methods
described in O'Keefe et al. [ 35 ] for direct quotation and find that they all are suitable
for indirect and mixed quotations.
There is few scientific work that aims the extraction of quotations exclusively from
German news articles. To the best of our knowledge only Pouliquen et al. and Akbik
and Schenck [ 1 ] deal with German language news articles. While Pouliquen et al.
include German into their multilingual system as one of many languages, Akbik and
Schenck present a system that automatically collects news from the main German
news sites and then extracts direct quotations from these news articles. Their approach
detects text between quotation marks as quote candidates and uses a named entity
recognizer to identify potential speakers. Then a set of heuristics is used to determine
the resulting quote-speaker tuples.
1.3.3 Approach
The proposed approach for extracting direct and reported speech from German news
articles is rule-based. For each quotation the system identifies a speaker, a report-
ing verb, or a preparative phrase (like “…, so Angela Merkel”.), and the quotation
text with all its parts. We divide the task into five subtasks and model our quotation
extraction approach as a processing pipeline where the news articles are annotated
in each step of the pipeline with further information. Figure 1.3 demonstrates the
included components and the working flow. Starting with a document preprocessing
component we perform linguistic analysis like part-of-speech tagging and lemmatiz-
ing that serve as a basis for further processing steps. The normalization of quotation
marks is important at this point as well. Detecting a reporting verb helps to identify
the reporting clauses and is also a strong indicator of indirect speech. We there-
fore search for them in the next step of our pipeline. Subsequently, we identify the
reporting clauses of direct and indirect quotations and determine the quotation parts
and exact boundaries of the entire quotation. Note that the boundaries of indirect or
mixed quotations may be ambiguous and in many cases difficult to recognize even
by humans. In the last step of our pipeline we attribute one or more quotation holders
to the previously identified quotations.
1.3.3.1 Document Preprocessing
Quotation marks normalization . News articles may contain malformed markup.
Especially in systems with automatic news harvesting from heterogeneous sources
the collected texts may be erroneous, e.g., in terms of incomplete articles, misplaced
Search WWH ::




Custom Search