Graphics Reference
In-Depth Information
contributions in an interaction (Morency et al., 2008).
Shallow versions of each of these features were calculated either
automatically or manually annotated from the dialogue manager
of the robot (or virtual human) or directly from a human's action.
For lexical features, individual words (unigrams) and word pairs
(bigrams) provided information regarding the likelihood of gestural
reaction. A range of techniques were used for prosodic features. Using
Aizula system (Ward and Tsukahara, 2000), pitch, intensity and other
prosodic features were automatically computed from the human's
speech (Morency et al., 2008). With robots and virtual humans, the
generated punctuation was used to approximate prosodic cues, such
as pauses and interrogative utterances. Synthesized visual gestures
are a key capability of robots and virtual humans, and they can also
be leveraged as a context cue for gesture interpretation. The gestures
expressed by the ECA influences the type of visual feedback from the
human participant. For example, if the agent makes a deictic pointing
gesture, the user is more likely to look at the location that the ECA
is pointing to; in human-human dialogues, a critical gestural feature
was where the speaker looked. This demonstrates that shallow, very
simple features are reliably useful in predicting nonverbal gestures.
The shallow features used in previous work were easy to calculate
or annotate and were used for both ECA-human and human-human
interactions. However, the features' very simplicity allows them to
capture only a small part of the information available in linguistic
and gestural behavior.
4.3 Modeling latent dynamic
One of the key challenges with modeling the individual and
interpersonal dynamics is to automatically learn the synchrony and
complementarities in a person's verbal and nonverbal behaviors and
between people. A new computational model called Latent-Dynamic
CRF (see Figure 6) was developed to incorporate hidden state variables
that model the sub-structure of a class sequence and learn dynamics
between class labels (Morency et al., 2007). It is a significant change
from previous approaches which only examined individual modalities,
ignoring the synergy between speech and gestures.
The task of the Latent-Dynamic CRF model is to learn a mapping
between a sequence of observations x = { x 1 , x 2 , ..., x m } and a sequence
of labels y = { y 1 , y 2 , ..., y m }. Each y j is a class label for the j th frame
of a video sequence and is a member of a set Y of possible class
labels, for example, Y = { backchannel , other-gesture }. Each observation
Search WWH ::




Custom Search