Game Development Reference
In-Depth Information
Huang, 2002). We videotape a subject who is asked to calmly read three
sentences with three different facial expressions: (1) neutral, (2) smile, and (3)
sad, respectively. The 3 sentences are: (1) “It is normal,” (2) “It is good,” and
(3) “It is bad.” The associated information is: (1) neutral, (2) positive, and (3)
negative. The audio tracks are used to generate three sets of face animation
sequences. All three audio tracks are used in each set of animation sequences.
The three sets are generated with a neutral expression, smiling, and sad,
respectively. The facial deformation due to speech and expression is linearly
combined in our experiments. Sixteen untrained human subjects, who never used
our system before, participate in the experiments.
The first experiment investigates human emotion perception based on either the
visual-only or audio-only stimuli. The subjects are first asked to infer their
emotional states based on the animation sequences without audio. The emotion
inference results in terms of the number of the subjects are shown in Table 1.
As shown, the effectiveness of the synthetic talking face is comparable with that
of the real face. The subjects are then asked to listen to the audio and decide the
emotional state of the speaker. Note that the audio tracks are produced without
emotions. The results in terms of the number of the subjects are shown in Table 1.
Table 1. Emotion inference based on visual only or audio only stimuli. “S”
column: Synthetic face; “R” column: Real face.
Facial Expression
Audio
Neutral
Smile
Sad
1
2
3
S
R
S
R
S
R
Neutral
16
16
4
3
2
0
16
6
7
Emotion
Happy
0
0
12
13
0
0
0
10
0
Sad
0
0
0
0
14
16
0
0
9
Table 2. Emotion inference results agreed with facial expressions. The
inference is based on both audio and visual stimuli. “S” column: Synthetic
face; “R” column: Real face.
Facial Expression
Smile
Sad
S
R
S
R
Audio-visual
relation
Same
15
16
16
16
Opposite
2
3
10
12
Search WWH ::




Custom Search