approaches using only one neural network for all audio features [Lavagetto‚
1995‚ Massaro and et al.‚ 1999]‚ our local ANN mapping (i.e. one neural net-
work for each audio feature cluster) is more efficient because each ANN is much
simpler. Therefore it can be trained with much less effort for a certain set of
training data. More generally‚ speech driven animation can be used in speech
and language eduction [Cole et al.‚ 1999]‚ speech understanding aid for noisy
environment and hard-of-hearing people‚ rehabilitation tool for facial motion
5.2.4 Human emotion perception study
The synthetic talking face‚ which is used to convey visual cues to human‚ can
be evaluated by human perception study. Here‚ we describe our experiments
which compare the influence of the synthetic talking face on human emotion
perception with that of the real face. We did similar experiments for 2D MU-
based speech driven animation [Hong et al.‚ 2002]. The experimental results can
help the user with how to use the synthetic talking face to deliver the intended
We videotape a speaking subject who is asked to calmly read three sentences
with 3 facial expressions: (1) neutral‚ (2) smile‚ and (3) sad‚ respectively.
Hence‚ the audio tracks do not convey any emotional information. The con-
tents of the three sentence are: (1) “It is normal.”; (2) “It is good.”; and (3) “It is
bad.”. The associated information is: (1) neutral; (2) positive; and (3) negative.
The audio tracks are used to generate three sets of face animation sequences.
All three audio tracks are used in each set of animation sequence. The first set
is generated without expression. The second set is generated with smile expres-
sion. The third set is generated with sad expression. The facial deformation
due to speech and expression is linearly combined in our experiments. Sixteen
untrained human subjects‚ who never used our system before‚ participate the
The first experiment investigates human emotion perception based on either
the visual stimuli alone or the audio stimuli alone. The subjects are first asked to
recognize the expressions of both the real face and the synthetic talking face and
infer their emotional states based on the animation sequences without audio.
All subjects correctly recognized the expressions of both the synthetic face and
the real face. Therefore‚ our synthetic talking face is capable to accurately
deliver facial expression information. The emotional inference results in terms
of the number of the subjects are shown in Table 5.2. The “S” columns
in Table 5.2‚ as well as in Table 5.4‚ 5.5‚ and 5.6‚ show the results using the
synthetic talking face. The “R” columns show the results using the real face.
As shown‚ the effectiveness of the synthetic talking face is comparable with that
of the real face. The subjects are then asked to listen to the audio and decide
the emotional state of the speaker. Each subject listens to each audio only once.