Co-speech Gesture Generation for Embodied Agents and its Effects on User Evaluation - Coverbal Synchrony in Human-Machine Interaction

Graphics Reference

In-Depth Information

3.1 Model-based gesture generation

The first systems investigating the challenge of iconic gesture

generation were lexicon-based approaches. In general, these systems

were characterized by a straightforward mapping of meaning onto

gesture form. The Behavior Expression Animation Toolkit (BEAT) was

among the first of a new generation of toolkits to allow the generation

of synthetic speech alongside synchronized nonverbal behaviors, such

as hand gestures and facial displays, to be realized with an animated

human figure (Cassell et al., 2001). This approach of mapping text

onto multimodal behavior was characterized by representing linguistic

and social context and applying behavior generation rules based on

empirical results. A similar approach was taken with the Nonverbal

Behavior Generator (NVBG), proposed by Lee and Marsella (2006). The

system analyzes the syntactic and semantic structure of surface texts

and takes the affective state of the embodied agent into account to

generate appropriate nonverbal behaviors. Based on a study from the

literature and a video analysis of emotional dialogues, the authors

developed a list of nonverbal behavior generation rules. The Real

Estate Agent (REA) is a more elaborate system as it aims to model

the bi-directional process of communication (Cassell, 2000). That is,

in addition to the generation of nonverbal behaviors, the system also

seeks to understand aspects of these same modalities' use by a human

interlocutor. The focus of gesture generation in the REA system is the

context-dependent coordination of (lexicalized) gestures with speech,

accounting for the fact that gestures do not always carry the same

meaning as speech.

Relying on empirical results, the systems mentioned so far focus

on the context-dependent coordination of gestures with concurrent

speech, whereby gestures are drawn from a lexicon. Flexibility and

generative power of gestures to express new content, therefore, is

obviously very limited. A different attempt that is closely related to

the generation of speech-accompanying gestures in a spatial domain

is Huenerfauth's (2008) system which translates English texts into

American Sign Language (ASL) focusing on classifier predicates which

are complex and descriptive types of ASL sentences. These classifier

predicates have several similarities with iconic gestures accompanying

speech. The system also relies on a library of prototypical templates

for each type of classifier predicates in which missing parameters are

filled in adaptation to the particular context.

The NUMACK system (Kopp et al., 2007) tries to overcome

the limitations of lexicon-based gesture generation by considering

patterns of human gesture composition. Based on empirical results,

Search WWH ::

Custom Search

Home