Graphics Reference
In-Depth Information
had something to say. They found that 30% of between-speaker silences
(turn-transitions) were shorter than 200 msecs and 70% shorter than
500 msecs. Within-turn silences, that is, silences where the same person
speaks before and after the silence, are on average around 200 msecs
but can be as long as 1 second, which has been reported to be the
average “silence tolerance” for American-English speakers (Jefferson,
1989); longer silences are thus likely to be interpreted by a listener as a
“turn-giving signal”. 4 Tolerance for silences in dialogue varies greatly
between individuals, ethnic groups, and situations; participants in a
political debate exhibit a considerably shorter silence tolerance than
people in casual conversation—this can further be impacted by social
norms (e.g. relationship of the conversants), information inferable from
the interaction (type of conversation, semantics, etc.), and internal
information (e.g. mood, sense of urgency, etc.). To be on par with
humans in turn-taking efficiency, a system thus needs to be able to
predict, given an observed silience, what the interlocutor intends to do.
The motivation for the present work is to develop a complete
conversational agent that can learn to interact and adapt its interaction
behavior to its conversational partners, in a short amount of time. The
agent may not know a lot about any particular topic of discussion,
but it would be an “expert dialoguer”, whose topic knowledge could
be expanded as needed for various applications and as permitted by
the artificial intelligence techniques under the hood. The Ymir Turn-
Taking Model (YTTM) dialogue model (Thórisson, 2002b) proposes a
framework for separating envelope control from topic control, making
such an approach tractable. As a first step in this endeavour we are
targeting a cooperative agent that can take turns, ideally with no
speech overlap, yet achieves the shortest possible silence duration
between speaker turns. Our approach is intended to achieve four
key goals. First , we want to use on-line open-mic and natural speech
when communicating with the system, integrating continuous acoustic
perceptions as basis for decision making. We do not want to assume
that the human must change their speech style or approach the system
any differently than they do another human they might talk to. Second ,
we want to model turn-taking at a higher level of detail than previous
attempts have done by including incremental perception and generation
in a unified way. Third , because of the high individual variability in
interaction style and pace, we want to incorporate learning from the
outset, allowing the system to adapt to every person it interacts with
4 “Turn-giving signals” are in quotes because they are not true “signals” in the
engineering sense of the term, but rather socially conditioned “contexts”—
combinations of features which together constitute “polite”, “improper”, “rude”,
or otherwise connotated contexts for the behavior of the interlocutors' behaviors.
 
Search WWH ::




Custom Search