Introduction: How to Build a Talking Robot - Computational Linguistics and Talking Robots

Database Reference

In-Depth Information

ted by natural language (Chaps. 10, 11). In DBS, the computation of these

perspectives is based on (i) suitable inferences and (ii) the values of the

agent's STAR parameters, for S pace, T ime, A gent, R ecipient.

The traditions from Structuralism to Nativism in linguistics and from Sym-

bolic Logic to Truth-Conditional Semantics in analytic philosophy have not

gotten around to addressing these points, let alone coming up with their own

cogent model of natural language communication suitable for a talking robot.

It is therefore little wonder that engineers and researchers trying to satisfy an

ever-growing demand for applications in human-machine communication are

turning in droves to other venues. Having long given up on expecting efficient

algorithms and sensible models 31 from linguistics and philosophy, the engi-

neers see no other option than to rely on their technology alone, while many

researchers have turned to statistics and markup by hand, as in statistical tag-

ging and the manual metadata markup of corpora. 32

Unfortunately, neither statistics nor markup seems to have a good long-term

prospect for building a talking robot. What is needed instead is a computa-

tional reconstruction of cognition in terms of interfaces, components, func-

tional flow, data structure, algorithm, database schema, and so on.

This in turn presupposes a theory of how natural language communication

works. The crux is that communication includes language production, and lan-

guage production requires an autonomous control for appropriate behavior.

Where else should the contents come from which are realized by the speaker

as language surfaces?

31 Consider, for example, Kripke's (1972) celebrated theory of proper names, defined as “rigid designa-

tors” in a set-theoretic model structure, and try to convince someone to use it in a practical application.

32 Markup and statistics are combined by using the manually annotated corpus as the core corpus for the

statistical analysis, usually based on HMMs. Markup by hand requires the training of large personnel

to consistently and sensibly annotate real texts. Statistical tagging can handle only a small percentage

of the word form types, at low levels of linguistic detail and accuracy (cf. FoCL'99, Sect. 15.5), in

part because about 50% of the word forms in a corpus occur no more than once (hapax legomena).

Corpora used for statistical analysis are big. A de facto standard is the British National Corpus

(BNC, 1991-1994) with 100 million running word forms. Corpora marked up by hand are usually

quite small. For example, MASC I (for Manually Annotated SubCorpus I) by Ide et al. (2010) consists

of ca. 82 000 running word forms, which amounts to 0.08% of the BNC. Extensions of MASC I are

intended as the core corpus for a statistical analysis of the American National Corpus (ANC).

An alternative approach are probabilistic context-free grammars (PCFGs) which use statistics to

produce phrase structure and dependency trees. Examples are the Stanford parser (Klein et al. 2003,

et seq.) and the Berkeley parser (Petrov et al. 2007, et seq.). An M.A. thesis at the CLUE currently

investigates the possibility of using a PCFG parser as the front end for a broad coverage DBS system.

This experiment is based on reinterpreting the sign-oriented PCFG approach as the hear mode

of an agent-oriented approach (Sect. 12.4). For use in DBS, the semantic relations coded indirectly

(Sect. 7.1) by the context-free PCFG trees must be translated automatically into sets of proplets. The

idea is a test of the storage, retrieval, inferencing, language interpretation, and language production

capabilities of DBS with the large amounts of data provided by the PCFG parser.

Computational Linguistics and Talking Robots

Search WWH ::

Custom Search

Home