Learning 3D Face Deformation Model: Methods and Applications - 3D Modeling and Animation: Synthesis and Analysis Techniques for the Human Body

Game Development Reference

In-Depth Information

tion (Aizawa & Huang, 1995), audio-visual speech recognition (Stork & Hennecke,

1996), and expression recognition (Cohen et al., 2002). Simple approaches only

utilize low-level image features (Goto, Kshirsagar & Thalmann, 2001). How-

ever, it is not robust enough to use low-level image features alone because the

error will be accumulated. High-level knowledge of facial deformation must be

used to handle the error accumulation problem by imposing constraints on the

possible deformed facial shapes. For 3D facial motion tracking, people have used

various 3D deformable model spaces, such as a 3D parametric model (DeCarlo,

1998), MPEG-4 FAP-based B-Spline surface (Eisert, Wiegand & Girod, 2000)

and FACS-based models (Tao, 1998). These models, however, are usually

manually defined, which cannot capture the real motion characteristics of facial

features well. Therefore, some researchers have recently proposed to train

facial motion subspace models from real facial motion data (Basu, Oliver &

Pentland, 1999; Reveret & Essa, 2001).

Facial Motion Synthesis

Based on spatial and temporal modeling of facial deformation, facial motion is

usually synthesized according to semantic input, such as text script (Waters &

Levergood, 1993), actor performance (Guenter et al., 1998), or speech (Brand,

1999; Morishima & Harashima, 1991). In this chapter, we focus on real-time

speech face animation.

A synthetic talking face is useful for multi-modal human computer interaction,

such as e-commerce (Pandzic, Ostermann & Millen, 1999) and computer-aided

education (Cole et al., 1999). To generate facial shapes directly from audio, the

core issue is the audio-to-visual mapping that converts the audio information into

the visual information about facial shapes. HMM-based methods (Brand, 1999)

utilize long-term contextual information to generate a smooth facial deformation

trajectory. However, they can only be used in off-line scenarios. For real-time

mapping, people have proposed various methods such as: Vector Quantization

(VQ) (Morishima & Harashima, 1991), Gaussian mixture model (GMM) (Rao

& Chen, 1996) and Artificial Neural Network (ANN) (Morishima & Harashima,

1991; Goto, Kshirsagar & Thalmann, 2001). To use short-time contextual

information for a smoother result, others have proposed to use a concatenated

audio feature over a short time window (Massaro et al., 1999) or to use time-

delay neural network (TDNN) (Lavagetto, 1995).

Search WWH ::

Custom Search

Home