Graphics Reference
In-Depth Information
authors tested their findings in a real-time system: Using information
about dialogue structure—speech act classes, a measure of semantic
completeness, and probability distribution of how long utterances go
(but not prosody)—the system improved turn-taking latency by as
much as 50% in some cases, but significantly less in others. This work
reported no benefits from prosody for this purpose, which is surprising
given that many studies have shown the opposite to be true (cf. Gratch
et al., 2006; Schlangen, 2006; Thórisson, 1996; Traum and Heeman, 1996;
Pierrehumbert and Hirschberg, 1990; Goodwin, 1981). We suspect one
reason could be that the pitch and intensity extraction methods they
used did not work very well on the data selected for analysis. Prosodic
information has successfully been used to determine back-channel
feedback in real-time. The Rapport Agent (Gratch et al., 2006) uses
gaze, posture, and prosodic perception, among other things, to detect
backchannel opportunities. The Ymir/Gandalf system (Thórisson,
1996) also used prosody, adding analysis of semantic, syntactic (and to
a small extent even pragmatic) completeness to determine turntaking
behaviors. Unfortunately evaluations of its benefit, for the purpose
of turn-taking per se, are not available. The major lesson that can be
learned from Raux and Eskenazi, echoing the work on Gandalf, is that
turn-taking can be improved through an integrated, coordinated use
of various features in context .
The problem of utterance segmenting for the purpose of proper
turn-taking has been addressed to some extent in prior work. Of all
the data sources informing dialogue participants about the state of the
dialogue, prosody is the most prominent among the non-semantic ones.
From the prior work reviewed, this seems like the most obvious place
to start when attempting to design turn-taking mechanisms. Sato et al.
(2002) use a decision tree to classify when silence signals that a turn
should be taken. They annotated various features in a large corpus
of human-human conversation to train and test the tree. The results
show that semantic and syntactic categories, as well as understanding,
are the most important features. These experiments have so far been
limited to annotated data of a single, task-oriented domain. Applying
their methods to a casual real-time conversation using today's speech
recognition methods would inevitably increase the recognition time
beyond any practical use because of an increased vocabulary—the
content interpretation results could simply not be produced fast and
reliably enough for making turn-taking decisions at sub-second speeds
(Jonsdottir et al., 2007).
The introduction of learning into a dialogue system gives its
designers yet another complex dimension which can affect everything
Search WWH ::




Custom Search