Image Processing Reference
In-Depth Information
0
15
An Emotional Talking Head for
a Humoristic Chatbot
Agnese Augello 1 , Orazio Gambino 1 , Vincenzo Cannella 1 , Roberto Pirrone 1 ,
Salvatore Gaglio 1 and Giovanni Pilato 2
1 DICGIM - University of Palermo, Palermo
2 ICAR - Italian National Research Council, Palermo
Italy
1. Introduction
The interest about enhancing the interface usability of applications and entertainment
platforms has increased in last years. The research in human-computer interaction on
conversational agents, named also chatbots, and natural language dialogue systems equipped
with audio-video interfaces has grown as well. One of the most pursued goals is to
enhance the realness of interaction of such systems. For this reason they are provided with
catchy interfaces using humanlike avatars capable to adapt their behavior according to the
conversation content. This kind of agents can vocally interact with users by using Automatic
Speech Recognition (ASR) and Text To Speech (TTS) systems; besides they can change their
“emotions” according to the sentences entered by the user. In this framework, the visual
aspect of interaction plays also a key role in human-computer interaction, leading to systems
capable to perform speech synchronization with an animated face model.
These kind of
systems are called Talking Heads.
Several implementations of talking heads are reported in literature. Facial movements are
simulated by rational free form deformation in the 3D talking head developed in Kalra et al.
(2006). A Cyberware scanner is used to acquire surface of a human face in Lee et al. (1995).
Next the surface is converted to a triangle mesh thanks to image analysis techniques oriented
to find reflectance local minima and maxima.
In Waters et al. (1994) the DECface system is presented. In this work, the animation of a
wireframe face model is synchronized with an audio stream provided by a TTS system. An
input ASCII text is converted into a phonetic transcription and a speech synthesizer generates
an audio stream. The audio server receives a query to determine the phoneme currently
running and the shape of the mouth is computed by the trajectory of the main vertexes. In
this way, the audio samples are synchronized with the graphics. A nonlinear function controls
the translation of the polygonal vertices in such a way to simulate the mouth movements.
Synchronization is achieved by calculating the deformation length of the mouth, based on the
duration of an audio samples group.
BEAT (Behavior Expression Animation Toolkit) an intelligent agent with human
characteristics controlled by an input text is presented in Cassell et al. (2001). A talking
head for the Web with a client-server architecture is described in Ostermann et al. (2000).
The client application comprises the browser, the TTS engine, and the animation renderer. A
Search WWH ::




Custom Search