GEOMETRIC FACIAL MOTION SYNTHESIS - 3D Face Processing: Modeling, Analysis and Synthesis

Graphics Reference

In-Depth Information

of vowels cluster in a stable triangular subspace (“vowel triangle”). For each

vowel the formants are in a certain region distinguishable form other vowels.

Along a smooth trajectory in the so called “vowel triangle” (Figure 5.6)‚ the

mouth shape changes smoothly. For other vowel-like sounds‚ diphthongs can

Figure 5.6. “Vowel Triangle” in the system‚ circles correspond to vowels [Rabiner and Shafer‚

1978].

be modeled as transitions between vowels; semivowels are transitions between

vowels and adjacent phonemes. Thus‚ they are (or partly are in the case of

semivowels) trajectories in the "“vowel triangle” [Rabiner and Shafer‚ 1978].

Based on those facts‚ we can define mouth shapes corresponding to vowel-like

sounds as a manifold over the “vowel triangle”. The manifold could be learned

from audio/visual data of recorded human speech. In our approach‚ we choose

to take a much simpler alternative‚ which makes use of the phoneme-viseme

correspondence. Visemes of vowels‚ which are widely used for facial anima-

tion‚ can be seen as observations of the manifold. The mouth shapes in other

places of the manifold can then be approximated by some interpolation tech-

niques. In speaker-independent case‚ the “vowel triangle” is enlarged and there

is overlap between different vowel regions [Rabiner and Shafer‚ 1978]. But

much of a vowel region is still distinguishable from others so that each region

can be roughly related to a unique mouth shape. Thus we can expect that the

mouth shape manifold assumption can still produce reasonable mouth shapes‚

although less natural than the speaker-dependent case. In our system‚ we use

the averaged values of measured formant frequencies of vowels for a wide range

of talkers [Rabiner and Shafer‚ 1978]. For sounds other than vowel-like sounds‚

the proposed mapping is inadequate. Energy envelope modulation is used as a

heuristics‚ to deal with that problem in the system shown in Figure 5.5 .

In the implemented real-time system‚ the average delay between speech input

frame and the generated animation frame is less than 100 ms. The generated

Search WWH ::

Custom Search

Home