Modeling Human Communication Dynamics for Virtual Human - Coverbal Synchrony in Human-Machine Interaction

Graphics Reference

In-Depth Information

nonverbal behavior. Figure 5 presents an example of contextual

prediction for the listener's backchannel.

The goal of a prediction model is to create online predictions of

human nonverbal behaviors based on external contextual information.

The prediction model learns automatically which contextual feature

is important and how it affects the timing of nonverbal behaviors.

This goal is achieved by using a machine learning approach wherein

a sequential probabilistic model is trained using a database of human

interactions. A sequential probabilistic model takes as input a sequence

of observation features (e.g., the speaker's features) and returns a

sequence of probabilities (e.g., of the listener's backchannel). Some of

the most popular sequential models are the Hidden Markov Model

(HMM) (Rabiner, 1989) and the Conditional Random Field (CRF)

(Lafferty et al., 2001). A main difference between these two models

is that the CRF is discriminative (i.e., tries to find the best way to

differentiate cases where the human/agent produces a particular

behavior or does not) while the HMM is generative (i.e., tries to

find the best way to generalize the samples from the cases where

the human/agent produces a behavior without considering the cases

where no such behavior occurs). The contextual prediction framework

is designed to work with any types of sequential probabilistic models.

At the core of the approach is the idea of context, the set of external

factors which can potentially influence a person's nonverbal behavior.

4.2 Context (shallow features)

Conceptually, context includes all verbal and nonverbal behaviors

performed by other participants (human, robot, computer or virtual

human) as well as the description of the interaction environment.

For a dyadic interaction between two humans (as shown in Figure

5), to predict the nonverbal behavior of the listener, the context will

include the speaker's verbal and nonverbal behaviors, including

head movements, eye gaze, pauses and prosodic features. What

differentiates the computation framework from “conventional”

multimodal approaches (e.g., audio-visual speech recognition) is that

the influence of other participants (and the environment) on a person's

nonverbal behavior is modeled directly instead of only modeling

signals from the same source (e.g., the listener in Figure 5).

In previous work, four types of contextual features were highlighted:

lexical features, prosody and punctuation features, timing information

and gesture displays. Such features were used to recognize human

nonverbal gestures, when the robot spoke to a human, or to generate

a gesture for a virtual human given a human's verbal and nonverbal

Search WWH ::

Custom Search

Home