A Distributed Architecture for Real-time Dialogue and On-task Learning of Efficient Co-operative Turn-taking - Coverbal Synchrony in Human-Machine Interaction

Graphics Reference

In-Depth Information

speech recognition, animation, planning, etc., some of which may be

off-the-shelf while others are custom-built. As such systems have to

be deconstructed and reconstructed often, CDM proposes blackboards

as the backbone for such integration. This makes it relatively easy to

change information flow, add or remove computational functionality,

etc., even at runtime, as we have regularly done.

As far as dialogue management and turn-taking are concerned,

modular or distributed approaches are scarce. Among the few is the

YTTM (Thórisson, 2002b), a model of multimodal real-time turn-

taking. YTTM proposes that processing related to turn-taking can be

separated, in a particular manner, from the processing of content (i.e.

topic). Echoing the CDM, its architectural approach is distributed

and modular and supports full-duplex multi-layered input analysis

and output generation with natural response times (real-time). One

of the background assumptions behind the approach, which has been

reinforced over time by systems built using the approach (Thórisson

et al., 2008; Jonsdottir, 2008; Ng-Thow-Hing et al., 2007), is that real-

time performance calls for the incremental processing of interpretation

and output generation.

The J.Jr. system (Thórisson, 1993) was a real-time communicative

agent that could take turns in real-time casual conversation with

a human. It was controlled by a finite state-machine architecture,

similar to the Subsumption Architecture (Brooks, 1986). The system

did not process the content of a user's speech, but instead relied on an

analysis of prosodic information to make decisions about when to ask

questions (i.e. take turn) and when to interject back-channel feedback.

While modular, this architecture turned out to be difficult to expand

into a larger, more intelligent architecture (Thórisson, 1996), especially

when confronted with features at different time scales and levels of

abstraction and detail (prosodic, semantic, pragmatic). Subsequent

work on Gandalf (Thórisson, 1996) incorporated mechanisms from J.Jr.

into the Ymir architecture, but presented a much more expandable,

modular system of perception modules, deciders, and action modules

in a holistic architecture that addressed content (interpretation and

generation of meaning) as well as envelope phenomena (process

control). A descendant of this architecture and methodology was

recently used in building an advanced dialogue and planning system

for the Honda ASIMO robot (Ng-Thow-Hing et al., 2007).

Raux and Eskenazi (2008) presented data from a corpus analysis

of an online bus scheduling/information system, showing that a

number of dialogue features, including speech act type, can be used

to improve the identification of speech endpoint, given a silence. The

Search WWH ::

Custom Search

Home