Database Reference
In-Depth Information
ted by natural language (Chaps. 10, 11). In DBS, the computation of these
perspectives is based on (i) suitable inferences and (ii) the values of the
agent's STAR parameters, for S pace, T ime, A gent, R ecipient.
The traditions from Structuralism to Nativism in linguistics and from Sym-
bolic Logic to Truth-Conditional Semantics in analytic philosophy have not
gotten around to addressing these points, let alone coming up with their own
cogent model of natural language communication suitable for a talking robot.
It is therefore little wonder that engineers and researchers trying to satisfy an
ever-growing demand for applications in human-machine communication are
turning in droves to other venues. Having long given up on expecting efficient
algorithms and sensible models 31 from linguistics and philosophy, the engi-
neers see no other option than to rely on their technology alone, while many
researchers have turned to statistics and markup by hand, as in statistical tag-
ging and the manual metadata markup of corpora. 32
Unfortunately, neither statistics nor markup seems to have a good long-term
prospect for building a talking robot. What is needed instead is a computa-
tional reconstruction of cognition in terms of interfaces, components, func-
tional flow, data structure, algorithm, database schema, and so on.
This in turn presupposes a theory of how natural language communication
works. The crux is that communication includes language production, and lan-
guage production requires an autonomous control for appropriate behavior.
Where else should the contents come from which are realized by the speaker
as language surfaces?
31 Consider, for example, Kripke's (1972) celebrated theory of proper names, defined as “rigid designa-
tors” in a set-theoretic model structure, and try to convince someone to use it in a practical application.
32 Markup and statistics are combined by using the manually annotated corpus as the core corpus for the
statistical analysis, usually based on HMMs. Markup by hand requires the training of large personnel
to consistently and sensibly annotate real texts. Statistical tagging can handle only a small percentage
of the word form types, at low levels of linguistic detail and accuracy (cf. FoCL'99, Sect. 15.5), in
part because about 50% of the word forms in a corpus occur no more than once (hapax legomena).
Corpora used for statistical analysis are big. A de facto standard is the British National Corpus
(BNC, 1991-1994) with 100 million running word forms. Corpora marked up by hand are usually
quite small. For example, MASC I (for Manually Annotated SubCorpus I) by Ide et al. (2010) consists
of ca. 82 000 running word forms, which amounts to 0.08% of the BNC. Extensions of MASC I are
intended as the core corpus for a statistical analysis of the American National Corpus (ANC).
An alternative approach are probabilistic context-free grammars (PCFGs) which use statistics to
produce phrase structure and dependency trees. Examples are the Stanford parser (Klein et al. 2003,
et seq.) and the Berkeley parser (Petrov et al. 2007, et seq.). An M.A. thesis at the CLUE currently
investigates the possibility of using a PCFG parser as the front end for a broad coverage DBS system.
This experiment is based on reinterpreting the sign-oriented PCFG approach as the hear mode
of an agent-oriented approach (Sect. 12.4). For use in DBS, the semantic relations coded indirectly
(Sect. 7.1) by the context-free PCFG trees must be translated automatically into sets of proplets. The
idea is a test of the storage, retrieval, inferencing, language interpretation, and language production
capabilities of DBS with the large amounts of data provided by the PCFG parser.
Search WWH ::




Custom Search