Game Development Reference
In-Depth Information
tion (Aizawa & Huang, 1995), audio-visual speech recognition (Stork & Hennecke,
1996), and expression recognition (Cohen et al., 2002). Simple approaches only
utilize low-level image features (Goto, Kshirsagar & Thalmann, 2001). How-
ever, it is not robust enough to use low-level image features alone because the
error will be accumulated. High-level knowledge of facial deformation must be
used to handle the error accumulation problem by imposing constraints on the
possible deformed facial shapes. For 3D facial motion tracking, people have used
various 3D deformable model spaces, such as a 3D parametric model (DeCarlo,
1998), MPEG-4 FAP-based B-Spline surface (Eisert, Wiegand & Girod, 2000)
and FACS-based models (Tao, 1998). These models, however, are usually
manually defined, which cannot capture the real motion characteristics of facial
features well. Therefore, some researchers have recently proposed to train
facial motion subspace models from real facial motion data (Basu, Oliver &
Pentland, 1999; Reveret & Essa, 2001).
Facial Motion Synthesis
Based on spatial and temporal modeling of facial deformation, facial motion is
usually synthesized according to semantic input, such as text script (Waters &
Levergood, 1993), actor performance (Guenter et al., 1998), or speech (Brand,
1999; Morishima & Harashima, 1991). In this chapter, we focus on real-time
speech face animation.
A synthetic talking face is useful for multi-modal human computer interaction,
such as e-commerce (Pandzic, Ostermann & Millen, 1999) and computer-aided
education (Cole et al., 1999). To generate facial shapes directly from audio, the
core issue is the audio-to-visual mapping that converts the audio information into
the visual information about facial shapes. HMM-based methods (Brand, 1999)
utilize long-term contextual information to generate a smooth facial deformation
trajectory. However, they can only be used in off-line scenarios. For real-time
mapping, people have proposed various methods such as: Vector Quantization
(VQ) (Morishima & Harashima, 1991), Gaussian mixture model (GMM) (Rao
& Chen, 1996) and Artificial Neural Network (ANN) (Morishima & Harashima,
1991; Goto, Kshirsagar & Thalmann, 2001). To use short-time contextual
information for a smoother result, others have proposed to use a concatenated
audio feature over a short time window (Massaro et al., 1999) or to use time-
delay neural network (TDNN) (Lavagetto, 1995).
Search WWH ::




Custom Search