Graphics Reference
In-Depth Information
developed an architecture with integrated machine learning, allowing
the system to automatically acquire proper turn-taking behavior. The
system learns cooperative (“polite”) turn-taking in real-time by talking
to humans via Skype. Results show performance to be close to that of
human, as found in naturally occurring dialogue, with 20% of the turn
transitions taking place in under 300 milliseconds (msecs) and 50%
under 500 msecs. Key contributions of this work are the methods for
constructing more capable dialogue systems with an increasing number
of integrated features, implementation of adaptivity for turn-taking,
and a firmer theoretical ground on which to build holistic dialogue
architectures.
As many have argued, turn-taking is a fundamental and necessary
mechanism for real-time verbal (and extraverbal) information exchange,
and should, in our opinion, be one of the key focus areas for those
interested in building complete artifi cial dialogue systems. Turn-taking
skills include minimizing overlaps, minimizing silences, giving proper
back-channel feedback, barge-in techniques, and other components
which most people handle fl uidly and with ease. People use various
multimodal behaviors including intonation and gaze, for example,
to signal that they have fi nished speaking and are expecting a reply
(Goodwin, 1981). Based on continuously streaming information from our
sensory organs, most of us pick up on such signals without problems,
infer the correct state of dialogue, and what the other participants
intend, and then automatically produce multimodal information in real-
time that achieves the goals of the dialogue. In amicable conversations,
participants usually share the goal of cooperation. Turn exchange—a
negotiation-based activity based on the massive historical training
(“socialization”') of the participants—usually proceeds so smoothly
that people do not even realize the degree of complexity inherent in
the processes responsible for making it happen.
The challenge of endowing synthetic agents with such skills
lies not only in the integration of perception and action in sensible
planning schemes but especially in the fact that these have to be
tightly coordinated while marching to a real-world clock. How easy
or difficult this is is dictated by the architectural framework in which
the mechanisms are being implemented, and a prime reason for the
broad overview we give of our dialogue architecture here.
In spite of recent progress in speech synthesis and recognition,
lack of temporal responsiveness is one of a few key components that
clearly sets current dialogue systems apart from humans; speech
recognition systems that have been in development for over two
decades are still far from addressing the needs of real-time dynamic
dialogue (Jonsdottir et al., 2007). Many researchers have pointed out
Search WWH ::




Custom Search