Graphics Reference
In-Depth Information
the lack of implemented systems intended to manage dynamic open-
microphone/full-duplex dialogue (cf. Moore, 2007; Allen et al., 2001;
Raux and Eskenazi, 2007), where the system is sufficiently aware of
when it is given the turn, and can be naturally interrupted at any
point in time by the human, and vice versa.
Although syntax, semantics, and pragmatics can indisputably play
a large role in the dynamics of turn-taking, we have argued elsewhere
that natural turn-taking is partially driven by a content-free planning
system 2 (Thórisson, 2002b). People rely on signals and contextual
cues that from the vantage point of humans are fairly primitive,
e.g. prosody, speech loudness, gaze direction, facial expressions,
etc. (Goodwin, 1981). In humans, recognition of prosodic patterns,
based on the timing of speech loudness, silences, and intonation, is
a more light-weight process than either word recognition, syntactic,
or semantic processing (Card et al., 1986). Processing load between
semantic processing and contextual/turn-signal processing is even
more pronounced for artificial perception (the former being more
computationally intensive than the latter), and therefore such cues
represent prime candidates for inclusion in the process of recognizing
turn signals in artificial dialogue systems. While in the future we intend
to address the full scope of human turn management contextual cues,
at present even these obvious ones present challenges to architectural
and system design for real-time performance that must be overcome,
and are therefore continuously addressed in our work.
In natural interactions, mid-sentence pauses are a frequent
occurrence. Humans have little difficulty recognizing these from
proper end-of-utterance silences, 3 and reliably determine the time at
which it is appropriate to take turn—even on the phone, when no
visual information is available. Temporal analysis of conversational
behaviors in human discourse shows that turn-transitions in natural
conversations most commonly take between 0 and 250 msecs (Stivers,
2009; Wilson and Wilson, 2005; Ford and Thompson, 1996; Goodwin,
1981) in face-to-face conversation. Silences in telephone conversations—
when visual cues are absent—are at least 100 msecs longer on average
(Bosch et al., 2005). In a study by Wilson and Wilson (2005), response
time is measured in a face-to-face scenario where both parties always
2 We use the term “planning” in the most general sense, referring to any system that
makes a priori decisions about what should happen before they are put in action. By
“content-free”' we mean, in short, virtually without consideration for the particular
dialogue topic of a conversation.
3 Silences are often not needed to signal end-of-turn in free-form human dialogue
because the interlocutor derives it from other cues, such as prosody and content,
often resulting in zero silence between turns (Goodwin, 1981).
 
Search WWH ::




Custom Search