Information Technology Reference
In-Depth Information
Table 1.1 The list of
reporting verbs used for the
quotation extraction approach
German reporting verbs
Sagen
Behaupten
Aussprechen
Abraten
Teilen
Meinen
Warnen
Erwähnen
Raten
Klären
Fragen
Betonen
Bejahen
Ausfragen
Aufklären
Denken
Loben
Ermahnen
Ausplaudern
Mitteilen
äußern
Zugeben
Beichten
Erklären
Begründen
laneous. Since the Stanford classifier sometimes misses some named entities we
decided to augment the list of named entities returned by the Stanford classifier by
the named entities identified by the part-of-speech tagger described above. The type
of the entities recognized in this way is tagged as UNKNOWN since the tagger marks
the entities without providing a type.
1.3.3.2 Reporting Verb Detection
The detection of reporting verbs, that is verbs introducing quotations, is especially
important for the recognition of reported speech and a quotation holder. Our report-
ing verbs detection approach is lexicon-based. We manually assembled a list of
25 common reporting verbs. We started with a set of six seed reporting verbs and
extended the set by adding synonyms from Wortschatz Leipzig. 15 The Wortschatz
Leipzig also outputs a frequency class that reports the relation of a word's frequency
to the most frequent word in the corpus. We pruned the list by removing rare words
(high frequency class) and very ambiguous words. Table 1.1 gives an overview of
the common German reporting verbs that the reporting verb detector uses in our
quotation extraction approach. Analyzing a text the reporting verb detector checks
for each word's lemma if it occurs in the list. The corresponding words are then
treated as reporting verb candidates for the quotations to extract.
1.3.3.3 Direct Quotation Extraction
All quotations within quotation marks are regarded as direct quotes (quoted speech).
The direct quote collector detects quotations employing pattern recognition and hand-
crafted rules. It first compiles a set of quotation candidates (text parts enclosed by
quotation marks) and then applies the set of handcrafted rules to them to construct
the final direct quote.
The applied pattern is composed of different combinations of left and right quota-
tion marks which must enclose at least one character. In order to avoid the detection
of single words or phrases that are emphasized with quotation marks, quotation can-
didates are discarded if they consist of less than four words. Furthermore, we check
15
http://wortschatz.uni-leipzig.de/ .
 
Search WWH ::




Custom Search