Graphics Reference
In-Depth Information
nonverbal behavior. Figure 5 presents an example of contextual
prediction for the listener's backchannel.
The goal of a prediction model is to create online predictions of
human nonverbal behaviors based on external contextual information.
The prediction model learns automatically which contextual feature
is important and how it affects the timing of nonverbal behaviors.
This goal is achieved by using a machine learning approach wherein
a sequential probabilistic model is trained using a database of human
interactions. A sequential probabilistic model takes as input a sequence
of observation features (e.g., the speaker's features) and returns a
sequence of probabilities (e.g., of the listener's backchannel). Some of
the most popular sequential models are the Hidden Markov Model
(HMM) (Rabiner, 1989) and the Conditional Random Field (CRF)
(Lafferty et al., 2001). A main difference between these two models
is that the CRF is discriminative (i.e., tries to find the best way to
differentiate cases where the human/agent produces a particular
behavior or does not) while the HMM is generative (i.e., tries to
find the best way to generalize the samples from the cases where
the human/agent produces a behavior without considering the cases
where no such behavior occurs). The contextual prediction framework
is designed to work with any types of sequential probabilistic models.
At the core of the approach is the idea of context, the set of external
factors which can potentially influence a person's nonverbal behavior.
4.2 Context (shallow features)
Conceptually, context includes all verbal and nonverbal behaviors
performed by other participants (human, robot, computer or virtual
human) as well as the description of the interaction environment.
For a dyadic interaction between two humans (as shown in Figure
5), to predict the nonverbal behavior of the listener, the context will
include the speaker's verbal and nonverbal behaviors, including
head movements, eye gaze, pauses and prosodic features. What
differentiates the computation framework from “conventional”
multimodal approaches (e.g., audio-visual speech recognition) is that
the influence of other participants (and the environment) on a person's
nonverbal behavior is modeled directly instead of only modeling
signals from the same source (e.g., the listener in Figure 5).
In previous work, four types of contextual features were highlighted:
lexical features, prosody and punctuation features, timing information
and gesture displays. Such features were used to recognize human
nonverbal gestures, when the robot spoke to a human, or to generate
a gesture for a virtual human given a human's verbal and nonverbal
Search WWH ::




Custom Search