Figure 5.4. The architecture of offline speech driven talking face.
5.1 Formant features for real-time speech-driven face
Multiple acoustic features are correlated to vocal tract shape. LPC features
are one of the most widely used features for speech driven animation [Brand,
1999, Curinga et al., 1996, Morishima and Yotsukura, 1999]. In this section,
we describe a technique using formant frequencies as acoustic features because
it is directly related to vowel-like sound including vowels, diphthongs and
semivowels. It is observed that the vowel sounds account for major shapes
of the mouth and make major contributions to the movement of mouth. Thus
formant analysis enables us to build a simple yet effective mapping for speech
driven animation [Wen et al., 2001].
5.1.1 Formant analysis
Human speech production system consists of two main components, the
vocal cords and the vocal tract. The vocal cords excitation serves as the source
of signal while vocal tract acts as a time-variant filter. The characteristics of
the two components decide the final output speech. In speech production, the
resonance frequencies of the vocal tract tube are called formant frequencies or
simply formants. The formant frequencies depend on the shape of vocal tract
and each shape is characterized by a set of formants [Rabiner and Shafer, 1978].
In practice, formant analysis is widely used to extract vocal tract characteristics
for speech analysis and synthesis. Many methods are available for formants
estimation. In our system, we use a method based on LPC parameters [Rabiner